sentiment analysis with tidytext in r
TidyText is an incredibly effective and approachable package in R for text mining that I stumbled across when flicking through some of the Studio::Conf 2017 materials a few days ago.
There's loads of information available about the TidyText package, along with its underlying philosophy, but this post focuses on an implementation of one small aspect of the package's power: the ability to analyse and plot the sentiment of words in documents.
my test data
I'm using a small dataset for this walkthrough that consists of three court judgments: two judgments of the UK Supreme Court and one from it predecessor, the Judicial Committee of the House of Lords:
- Donoghue v Stevenson - a classic case on the tort of negligence
- Evans v Attorney General - the so-called "black spider memos" case
- R (Miller) v SoS for BREXIT - the recent judgment of the Supreme Court addressing the question of whether the government can trigger Article 50 without express Parliamentary authorisation.
The subject matter of the dataset isn't really that important. My purpose here is to use the tools TidyText makes available to chart the emotional attributes of the words uses in these judgment over the course of each document.
get the data ready for analysis
First off, we need to get our environment ready and set out working directory:
# Environment library(tidyverse) library(tidytext) library(stringr) # Data - setwd to the folder that contains the data you want to work with setwd("~/Documents/R/TidyText")
Next, we're going to get the text of the files (in my case, the three judgments) into a data frame:
case_words <- data_frame(file = paste0(c("evans.txt", "miller.txt", "donoghue.txt"))) %>% mutate(text = map(file, read_lines))
This gives us a tibble with a single variable equal to the name of the file. We now need to unnest that tibble so that we have the document, line numbers and words as columns.
case_words <- case_words %>% unnest() %>% mutate(line_number = 1:n(), file = str_sub(basename(file), 1)) case_words$file <- forcats::fct_relevel(case_words$file, c("evans", "miller", "donoghue"))
We get a tibble that looks like this:
# A tibble: 96,318 × 3 file line_numberword <fctr> <int> <chr> 1evans.txt 1lord 2evans.txt 1 neuberger 3evans.txt 1with 4evans.txt 1whom 5evans.txt 1lord 6evans.txt 1kerr 7evans.txt 1 and 8evans.txt 1lord 9evans.txt 1reed 10 evans.txt 1 agree # ... with 96,308 more rows
You can check the state of your table at this point by running,
The last thing we need to do before we're ready to begin computing the sentiment of the data is to tokenise the words in our tibble:
case_words <- case_words %>% unnest_tokens(word, text)
sentiment analysis of the text
We are ready to start analysing the sentiment of the data. TidyText is armed with three different sentiment dictionaries, afinn, nrc and Bing. The first thing we're going to do is get a birds eye view of the different sentiment profiles of each judgment using the nrc dictionary and plot the results using ggplot:
case_words %>% inner_join(get_sentiments("nrc")) %>% group_by(index = line_number %/% 20, file, sentiment) %>% summarize(n = n()) %>% ggplot(aes(x = index, y = n, fill = file)) + geom_bar(stat = "identity", alpha = 0.7) + facet_wrap(~ sentiment, ncol = 3)
The x-axis of each graph represents the position within the each document from beginning to end, the y-axis quantifies the intensity of the sentiment under analysis.
We can get a closer look at the emotional fluctuation by plotting an analysis using the afinn and Bing dictionaries:
case_words %>% left_join(get_sentiments("bing")) %>% left_join(get_sentiments("afinn")) %>% group_by(index = line_number %/% 20, file) %>% summarize(afinn = mean(score, na.rm = TRUE), bing = sum(sentiment == "positive", na.rm = TRUE) - sum(sentiment == "negative", na.rm = TRUE)) %>% gather(lexicon, lexicon_score, afinn, bing) %>% ggplot(aes(x = index, y = lexicon_score, colour = file)) + geom_smooth(stat = "identity") + facet_wrap(~ lexicon, scale = "free_y") + scale_x_continuous("Location in judgment", breaks = NULL) + scale_y_continuous("Lexicon Score")
The Bing analysis (pictured right), appears to provide a slightly more stable view. Instances moving above the zero-line indicate positive emotion, instances moving below the zero line indicate negative emotion.
For example, if we take the judgment in R (Miller), we can see that the first half of the judgment is broadly positive, but then dips suddenly around the middle of the judgment. The line indicates that the second half of the judgment text is slightly more negative than the first half, but rises to it's peak of positivity just before the end of the text.
The text of the judgment in Donoghue is considerably more negative. The curve sharply dips as the judgment opens, turns more positive towards the middle of the document, takes a negative turn once more and resolves to a more positive state towards the conclusion.