Title: | Extracts Sentiment and Sentiment-Derived Plot Arcs from Text |
---|---|
Description: | Extracts sentiment and sentiment-derived plot arcs from text using a variety of sentiment dictionaries conveniently packaged for consumption by R users. Implemented dictionaries include "syuzhet" (default) developed in the Nebraska Literary Lab "afinn" developed by Finn Årup Nielsen, "bing" developed by Minqing Hu and Bing Liu, and "nrc" developed by Mohammad, Saif M. and Turney, Peter D. Applicable references are available in README.md and in the documentation for the "get_sentiment" function. The package also provides a hack for implementing Stanford's coreNLP sentiment parser. The package provides several methods for plot arc normalization. |
Authors: | Matthew Jockers [aut, cre] |
Maintainer: | Matthew Jockers <[email protected]> |
License: | GPL-3 |
Version: | 1.0.7 |
Built: | 2024-11-08 04:45:37 UTC |
Source: | https://github.com/mjockers/syuzhet |
Converts input values into a standardized set of filtered and reverse transformed values for easy plotting and/or comparison.
get_dct_transform( raw_values, low_pass_size = 5, x_reverse_len = 100, scale_vals = FALSE, scale_range = FALSE )
get_dct_transform( raw_values, low_pass_size = 5, x_reverse_len = 100, scale_vals = FALSE, scale_range = FALSE )
raw_values |
the raw sentiment values calculated for each sentence |
low_pass_size |
The number of components to retain in the low pass filtering. Default = 5 |
x_reverse_len |
the number of values to return via decimation. Default = 100 |
scale_vals |
Logical determines whether or not to normalize the values using the scale function Default = FALSE. If TRUE, values will be scaled by subtracting the means and scaled by dividing by their standard deviations. See ?scale |
scale_range |
Logical determines whether or not to scale the values from -1 to +1. Default = FALSE. If set to TRUE, the lowest value in the vector will be set to -1 and the highest values set to +1 and all the values scaled accordingly in between. |
The transformed values
s_v <- get_sentences("I begin this story with a neutral statement. Now I add a statement about how much I despise cats. I am allergic to them. I hate them. Basically this is a very silly test. But I do love dogs!") raw_values <- get_sentiment(s_v, method = "syuzhet") dct_vals <- get_dct_transform(raw_values) plot(dct_vals, type="l", ylim=c(-0.1,.1))
s_v <- get_sentences("I begin this story with a neutral statement. Now I add a statement about how much I despise cats. I am allergic to them. I hate them. Basically this is a very silly test. But I do love dogs!") raw_values <- get_sentiment(s_v, method = "syuzhet") dct_vals <- get_dct_transform(raw_values) plot(dct_vals, type="l", ylim=c(-0.1,.1))
Calls the NRC sentiment dictionary to calculate the presence of eight different emotions and their corresponding valence in a text file.
get_nrc_sentiment( char_v, cl = NULL, language = "english", lowercase = TRUE, lexicon = NULL )
get_nrc_sentiment( char_v, cl = NULL, language = "english", lowercase = TRUE, lexicon = NULL )
char_v |
A character vector |
cl |
Optional, for parallel analysis |
language |
A string |
lowercase |
should tokens be converted to lowercase. Default equals TRUE |
lexicon |
a custom lexicon provided by the user and formatted as a data frame containing two columns labeled as "word" and "sentiment". The "sentiment" column must indicate either the valence of the word (using either the term "positive" or "negative") or the emotional category of the word, using one of the following terms: "anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", "trust". For example: the English word "abandon" may appear in your lexicon twice, first with a emotional category of "fear" and again with a value of "negative." Not all words necessarily need to have a valence indicator. See example section below |
A data frame where each row represents a sentence From the original file. The columns include one for each emotion type as well as a positive or negative valence. The ten columns are as follows: "anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", "trust", "negative", "positive."
Saif Mohammad and Peter Turney. "Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon." In Proceedings of the NAACL-HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, June 2010, LA, California. See: http://saifmohammad.com/WebPages/lexicons.html
my_lexicon <- data.frame( word = c("love","love", "hate", "hate"), sentiment = c("positive", "joy", "negative", "anger") ) my_example_text <- "I am in love with R programming. I hate writing code in C." s_v <- get_sentences(my_example_text) get_nrc_sentiment(s_v, lexicon=my_lexicon)
my_lexicon <- data.frame( word = c("love","love", "hate", "hate"), sentiment = c("positive", "joy", "negative", "anger") ) my_example_text <- "I am in love with R programming. I hate writing code in C." s_v <- get_sentences(my_example_text) get_nrc_sentiment(s_v, lexicon=my_lexicon)
Access the NRC dictionary to compute emotion types and valence for a set of words in the input vector.
get_nrc_values(word_vector, language = "english", lexicon = NULL)
get_nrc_values(word_vector, language = "english", lexicon = NULL)
word_vector |
A character vector. |
language |
A string |
lexicon |
A data frame with at least the columns "word", "sentiment" and "value". If NULL, internal data will be taken. |
A vector of values for the emotions and valence detected in the input vector.
Chunks text into 100 Percentage based segments and calculates means.
get_percentage_values(raw_values, bins = 100)
get_percentage_values(raw_values, bins = 100)
raw_values |
Raw sentiment values |
bins |
The number of bins to split the input vector. Default is 100 bins. |
A vector of mean values from each chunk
Assigns sentiment values to words based on preloaded dictionary. The default is the syuzhet dictionary.
get_sent_values(char_v, method = "syuzhet", lexicon = NULL)
get_sent_values(char_v, method = "syuzhet", lexicon = NULL)
char_v |
A string |
method |
A string indicating which sentiment dictionary to use |
lexicon |
A data frame with with at least two columns named word and value. Works with "nrc" or "custom" method. If using custom method, you must load a custom lexicon as a data frame with aforementioend columns. |
A single numerical value (positive or negative) based on the assessed sentiment in the string
Parses a string into a vector of sentences.
get_sentences(text_of_file, fix_curly_quotes = TRUE, as_vector = TRUE)
get_sentences(text_of_file, fix_curly_quotes = TRUE, as_vector = TRUE)
text_of_file |
A Text String |
fix_curly_quotes |
logical. If |
as_vector |
If |
A Character Vector of Sentences
(x <- c(paste0( "Mr. Brown comes! He says hello. i give him coffee. i will ", "go at 5 p. m. eastern time. Or somewhere in between!go there" ), paste0( "Marvin K. Mooney Will You Please Go Now!", "The time has come.", "The time has come. The time is now. Just go. Go. GO!", "I don't care how." ))) get_sentences(x) get_sentences(x, as_vector = FALSE)
(x <- c(paste0( "Mr. Brown comes! He says hello. i give him coffee. i will ", "go at 5 p. m. eastern time. Or somewhere in between!go there" ), paste0( "Marvin K. Mooney Will You Please Go Now!", "The time has come.", "The time has come. The time is now. Just go. Go. GO!", "I don't care how." ))) get_sentences(x) get_sentences(x, as_vector = FALSE)
Iterates over a vector of strings and returns sentiment values based on user supplied method. The default method, "syuzhet" is a custom sentiment dictionary developed in the Nebraska Literary Lab. The default dictionary should be better tuned to fiction as the terms were extracted from a collection of 165,000 human coded sentences taken from a small corpus of contemporary novels. At the time of this release, Syuzhet will only work with languages that use Latin character sets. This effectively means that "Arabic", "Bengali", "Chinese_simplified", "Chinese_traditional", "Greek", "Gujarati", "Hebrew", "Hindi", "Japanese", "Marathi", "Persian", "Russian", "Tamil", "Telugu", "Thai", "Ukranian", "Urdu", "Yiddish" are not supported even though these languages are part of the extended NRC dictionary.
get_sentiment( char_v, method = "syuzhet", path_to_tagger = NULL, cl = NULL, language = "english", lexicon = NULL, regex = "[^A-Za-z']+", lowercase = TRUE )
get_sentiment( char_v, method = "syuzhet", path_to_tagger = NULL, cl = NULL, language = "english", lexicon = NULL, regex = "[^A-Za-z']+", lowercase = TRUE )
char_v |
A vector of strings for evaluation. |
method |
A string indicating which sentiment method to use. Options include "syuzhet", "bing", "afinn", "nrc" and "stanford." See references for more detail on methods. |
path_to_tagger |
local path to location of Stanford CoreNLP package |
cl |
Optional, for parallel sentiment analysis. |
language |
A string. Only works for "nrc" method |
lexicon |
a data frame with at least two columns labeled "word" and "value." |
regex |
A regular expression for splitting words. Default is "[^A-Za-z']+" |
lowercase |
should tokens be converted to lowercase. Default equals TRUE |
Return value is a numeric vector of sentiment values, one value for each input sentence.
Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web." Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.
Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA. See: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon
Saif Mohammad and Peter Turney. "Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon." In Proceedings of the NAACL-HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, June 2010, LA, California. See: http://saifmohammad.com/WebPages/lexicons.html
Finn Årup Nielsen. "A new ANEW: Evaluation of a word list for sentiment analysis in microblogs", Proceedings of the ESWC2011 Workshop on 'Making Sense of Microposts':Big things come in small packages 718 in CEUR Workshop Proceedings : 93-98. 2011 May. http://arxiv.org/abs/1103.2903. See: http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010
Manning, Christopher D., Surdeanu, Mihai, Bauer, John, Finkel, Jenny, Bethard, Steven J., and McClosky, David. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60. See: http://nlp.stanford.edu/software/corenlp.shtml
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts. "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank Conference on Empirical Methods in Natural Language Processing" (EMNLP 2013). See: http://nlp.stanford.edu/sentiment/
Get the sentiment dictionaries used in syuzhet.
get_sentiment_dictionary(dictionary = "syuzhet", language = "english")
get_sentiment_dictionary(dictionary = "syuzhet", language = "english")
dictionary |
A string indicating which sentiment dictionary to return. Options include "syuzhet", "bing", "afinn", and "nrc". |
language |
A string indicating the language to choose if using the NRC dictionary and a language other than English |
get_sentiment_dictionary() get_sentiment_dictionary('bing') get_sentiment_dictionary('afinn') get_sentiment_dictionary('nrc', language = "spanish")
get_sentiment_dictionary() get_sentiment_dictionary('bing') get_sentiment_dictionary('afinn') get_sentiment_dictionary('nrc', language = "spanish")
Call the Stanford Sentiment tagger with a vector of strings. The Stanford tagger automatically detects sentence boundaries and treats each sentence as a distinct instance to measure. As a result, the vector that gets returned will not be the same length as the input vector.
get_stanford_sentiment(text_vector, path_to_stanford_tagger)
get_stanford_sentiment(text_vector, path_to_stanford_tagger)
text_vector |
A vector of strings |
path_to_stanford_tagger |
a local file path indicating where the coreNLP package is installed. |
Loads a file as a single text string.
get_text_as_string(path_to_file)
get_text_as_string(path_to_file)
path_to_file |
file path |
A character vector of length 1 containing the text of the file in the path_to_file argument.
Parses a string into a vector of word tokens.
get_tokens(text_of_file, pattern = "\\W", lowercase = TRUE)
get_tokens(text_of_file, pattern = "\\W", lowercase = TRUE)
text_of_file |
A Text String |
pattern |
A regular expression for token breaking |
lowercase |
should tokens be converted to lowercase. Default equals TRUE |
A Character Vector of Words
Please Note: This function is maintained for legacy purposes. Users should consider using get_dct_transform() instead. Converts input values into a standardized set of filtered and reverse transformed values for easy plotting and/or comparison.
get_transformed_values( raw_values, low_pass_size = 2, x_reverse_len = 100, padding_factor = 2, scale_vals = FALSE, scale_range = FALSE )
get_transformed_values( raw_values, low_pass_size = 2, x_reverse_len = 100, padding_factor = 2, scale_vals = FALSE, scale_range = FALSE )
raw_values |
the raw sentiment values calculated for each sentence |
low_pass_size |
The number of components to retain in the low pass filtering. Default = 3 |
x_reverse_len |
the number of values to return. Default = 100 |
padding_factor |
the amount of zero values to pad raw_values with, as a factor of the size of raw_values. Default = 2. |
scale_vals |
Logical determines whether or not to normalize the values using the scale function Default = FALSE. If TRUE, values will be scaled by subtracting the means and scaled by dividing by their standard deviations. See ?scale |
scale_range |
Logical determines whether or not to scale the values from -1 to +1. Default = FALSE. If set to TRUE, the lowest value in the vector will be set to -1 and the highest values set to +1 and all the values scaled accordingly in between. |
The transformed values
s_v <- get_sentences("I begin this story with a neutral statement. Now I add a statement about how much I despise cats. I am allergic to them. Basically this is a very silly test.") raw_values <- get_sentiment(s_v, method = "bing") get_transformed_values(raw_values)
s_v <- get_sentences("I begin this story with a neutral statement. Now I add a statement about how much I despise cats. I am allergic to them. Basically this is a very silly test.") raw_values <- get_sentiment(s_v, method = "bing") get_transformed_values(raw_values)
This function calculates the "emotional entropy" of a string based on the amount of conflicting valence. Emotional entropy is a measure of unpredictability and surprise based on the consistency or inconsistency of the emotional language in a given string. A string with conflicting emotional language may be said to express a "mixed message."
mixed_messages(string, remove_neutral = TRUE)
mixed_messages(string, remove_neutral = TRUE)
string |
A string of words |
remove_neutral |
Logical indicating whether or not to remove words with neutral valence before computing the emotional entropy of the string. Default is TRUE |
A vector
containing two named values
text_v <- "That's the love and the hate of it" mixed_messages(text_v) # [1] 1.0 0.5 = high (1.0, 0.5) entropy mixed_messages(text_v, TRUE) # Example of a predictable message i.e. no surprise text_v <- "I absolutley love, love, love it." mixed_messages(text_v) # [1] 0 0 = low entropy e.g. totally consistent emotion, i.e. no surprise mixed_messages(text_v, FALSE) # A more realistic example with a lot of mixed emotion. text_v <- "I loved the way he looked at me but I hated that he was no longer my lover" mixed_messages(text_v) # [1] 0.91829583 0.05101644 pretty high entropy. mixed_messages(text_v, FALSE) # A more realistic example without a lot of mixed emotion. text_v <- "I loved the way he looked at me and I was happy that he was my lover." mixed_messages(text_v) # [1] 0 0 low entropy, no surprise. mixed_messages(text_v, FALSE) # An urealistic example with a lot of mixed emotion. text_v <- "I loved, hated and despised the way he looked at me and I was happy as hell that he was my white hot lover." mixed_messages(text_v) mixed_messages(text_v, FALSE)
text_v <- "That's the love and the hate of it" mixed_messages(text_v) # [1] 1.0 0.5 = high (1.0, 0.5) entropy mixed_messages(text_v, TRUE) # Example of a predictable message i.e. no surprise text_v <- "I absolutley love, love, love it." mixed_messages(text_v) # [1] 0 0 = low entropy e.g. totally consistent emotion, i.e. no surprise mixed_messages(text_v, FALSE) # A more realistic example with a lot of mixed emotion. text_v <- "I loved the way he looked at me but I hated that he was no longer my lover" mixed_messages(text_v) # [1] 0.91829583 0.05101644 pretty high entropy. mixed_messages(text_v, FALSE) # A more realistic example without a lot of mixed emotion. text_v <- "I loved the way he looked at me and I was happy that he was my lover." mixed_messages(text_v) # [1] 0 0 low entropy, no surprise. mixed_messages(text_v, FALSE) # An urealistic example with a lot of mixed emotion. text_v <- "I loved, hated and despised the way he looked at me and I was happy as hell that he was my white hot lover." mixed_messages(text_v) mixed_messages(text_v, FALSE)
Rescale Transformed values from -1 to 1
rescale(x)
rescale(x)
x |
A vector of values |
Rescales input values to two scales (0 to 1 and -1 to 1) on the y-axis and also creates a scaled vector of x axis values from 0 to 1. This function is useful for plotting and plot comparison.
rescale_x_2(v)
rescale_x_2(v)
v |
A vector of values |
A list of three vectors (x, y, z). x is a vector of values from 0 to 1 equal in length to the input vector v. y is a scaled (from 0 to 1) vector of the input values equal in length to the input vector v. z is a scaled (from -1 to +1) vector of the input values equal in length to the input vector v.
A simple function for comparing three smoothers
simple_plot( raw_values, title = "Syuzhet Plot", legend_pos = "top", lps = 10, window = 0.1 )
simple_plot( raw_values, title = "Syuzhet Plot", legend_pos = "top", lps = 10, window = 0.1 )
raw_values |
the raw sentiment values calculated for each sentence |
title |
for resulting image |
legend_pos |
position for legend |
lps |
size of the low pass filter. I.e. the number of low frequency components to retain |
window |
size of the rolling window for the rolling mean expressed as a percentage. |