Text-mining

A first run with Hierarchical Dirichlet Process

I've been experimenting with Latent Dirichlet Allocation for a while in R, but was looking for a topic model algorithm that did not require the number of topics (k) to be defined apriori the application of the algorithm to the text data I wanted to work with. 

As a relative newcomer to topic modelling, I hadn't even heard of David Blei or Chong Wang, both of whom I now know to be pioneers in modern topic modelling. Quite by chance, I stumbled on Wang and Blei's implementation of Hierarchical Direchlet Processing in C++, a topic model where the data determine the number of topics. 

Getting HDP to work required quite a bit of wrangling and I couldn't find any walkthroughs suitable for novices like me, so I thought it would be worth noting up how I managed to get it to work with a small sample set of text data. 

What follows is far from a perfect (or even good) representation of how to apply HDP to text data, but the steps that follow did work for me. Here goes...


Sample Data

It goes without saying that the very first step is to assemble a corpus of text data against which we'll apply the HDP algorithm. In my case, as is usual, I used ten English judgments (all of which are recent decisions from the Criminal Division of the Court of Appeal) in .txt format. Save these into a folder.


Getting the Sample Data ready for HDP

Before we even go near Wang & Blei's algorithm, we need to prepare the sample data in a particular way. 

The algorithm requires the data to be in LDA-C format, which looks like this:

 [M] [term_1]:[count] [term_2]:[count] ...[term_N]:[count]
where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document.

This presents the first problem, because our data appear as words in .txt format files. Fortunately, there's an excellent Python program called text2ldac that comes to our rescue. Text2ldac takes the data in .txt format and outputs the files we need, in the form in which we need them.

Clone text2ldac from the git repo here

Once you've pulled down text2ldac, you're ready to take your text files and process them. To do this, go to the command line and run the following command (make adjustments to the example that follows to suit your own directories and filenames:

$ text2ldac danielhoadley$ python text2ldac.py --stopwords stopwords.txt /Users/danielhoadley/Documents/Topic_Model/text2ldac/input

All that's happening here is we're running text2ldac.py, using the --stopwords flag to pass in our stopwords (which, in my case, are in a file named stopword.txt) and then passing in the directory that contains our .txt files.

This will output three files: a .dat file (e.g. input.dat) which is the all import LDA-C formatted input for the HDP algorithm, a .vocab file, which contains all of the words in the corpus (one word per line) and a .dmap file, which lists the input .txt documents. 


Time to run HDP

Now that we have our data in the format required by the HDP algorithm, we're ready to apply the algorithm. 

For convenience, I recommend copying the three files generated by text2ldac into the folder you're going to run HDP from, but you can leave them wherever you like. 

Go to the folder containing the HDP program files and run the following command (again, adjust to your own folder and filenames:

$ ./hdp --algorithm train --data /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.dat --directory train_dir

Let's unpack this a bit: 

1. ./hdp invokes the HDP program

2. The --algorithm flag sets the algorithm to be applied, namely train

3. The path that follows the algorithm flag points to the .dat file produced by text2ldac

4. --directory train_dir is telling HDP to place the output files in a directory called train_dir

You'll know you've successfully executed the program if the prompt begins printing something that looks like this:

Program starts with following parameters:
algorithm:= train
data_path:= /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.dat
directory:= trainer
max_iter= 1000
save_lag= 100
init_topics = 0
random_seed = 1488746763
gamma_a = 1.00
gamma_b = 1.00
alpha_a = 1.00
alpha_b = 1.00
eta = 0.50
#restricted_scans = 5
split-merge = no
sampling hyperparam = no

reading data from /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.dat

number of docs: 9
number of terms : 5795
number of total words : 35865

starting with 7 topics 

iter = 00000, #topics = 0008, #tables = 0076, gamma = 1.00000, alpha = 1.00000, likelihood = -305223.54210
iter = 00001, #topics = 0008, #tables = 0079, gamma = 1.00000, alpha = 1.00000, likelihood = -301582.68017
iter = 00002, #topics = 0008, #tables = 0079, gamma = 1.00000, alpha = 1.00000, likelihood = -300273.98808

I'm not going to go into all of these parameters here, but the main one to note is the max_iter, which sets the number of time the algorithm walks over the test data. 

Note also that the algorithm has decided by itself how many topics it's going to work with (in the above example, 7)

The algorithm will produce a bunch of .dat files. The one we're really interested in should have a name like mode-topics.dat and not like this 00300-topics.dat (which was produced on the 300th iteration of the algorithm's walk). 


Printing the topics determined by HDP

Wang & Blei very helpfully provided a R script, print.topics.r, which you can use to turn the results of the algorithm into a human-readable form. This is helpful because the output generated by the algorithm will look like this:

00001 00001 00007 00007 00007
00011 00000 00000 00000 00000

The key thing at this stage is to remember two key files you'll need as input for the R script: the mode-topics.dat file (or similar name) generated by HDP and the .vocab file generated by text2ldac. 

Go back to the command line and navigate to the folder that contains print.topics.r. First, you'll need to make the R script executable, so run:

$ sudo chmod +x print.topics.R

Then run

$ ./print.topics.r /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/trainer/mode-topics.dat /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.vocab topics.dat 4

1. ./print.topics.r runs the R script

2. The first argument is the path to the mode-topics.dat file produced by the HDP algorithm

3. The second argument is the path to the .vocab file produced by text2ldac

4. The third argument is the name of the file you want to output the human-readable result to, e.g. topics.dat

5. Finally, the fourth argument, which isn't mandatory, is the number of terms per topic you wish to output - the default is 5.


The output

If everything has worked as it ought to have done, you'll get an output like this if you look at the topic.dat file in RStudio:

The output is by no means perfect the first time around. Better results will probably depend on hitting the source data with a raft of stop words and tweaking HDP's many parameters, but it's a good start.

 

 

 

 

Sentiment in Case Law

Created with Sketch.

For the past few months, I've been exploring various methods of unlocking interesting data from case law. This post discusses the ways in which data science techniques can be used to analyse the sentiment of the text of judgments.

The focus in this post is mainly technical and describes the steps I've taken using a statistical programming language, called R, to extract an "emotional" profile of three cases.

I have yet to arrive at any firm hypothesis of how this sort of technique can be used to draw conclusions that would necessarily be of use in court or during case preparation, but I hope some of the initial results canvassed below are interest to a few. 


TidyText is an incredibly effective and approachable package in R for text mining that I stumbled across when flicking through some of the Studio::Conf 2017 materials a few days ago. 

There's loads of information available about the TidyText package, along with its underlying philosophy, but this post focuses on an implementation of one small aspect of the package's power: the ability to analyse and plot the sentiment of words in documents.

MY TEST DATA

I'm using a small dataset for this walkthrough that consists of three court judgments: two judgments of the UK Supreme Court and one from its predecessor, the Judicial Committee of the House of Lords:

The subject matter of the dataset isn't really that important. My purpose here is to use the tools TidyText makes available to chart the emotional attributes of the words used in these judgments over the course of each document. 

GET THE DATA READY FOR ANALYSIS

First off, we need to get our environment ready and set our working directory:

# Environment
library(tidyverse)
library(tidytext)
library(stringr)

# Data - setwd to the folder that contains the data you want to work with
setwd("~/Documents/R/TidyText")

Next, we're going to get the text of the files (in my case, the three judgments) into a data frame:

case_words <- data_frame(file = paste0(c("evans.txt", "miller.txt", "donoghue.txt"))) %>%
mutate(text = map(file, read_lines))

This gives us a tibble with a single variable equal to the name of the file. We now need to unnest that tibble so that we have the document, line numbers and words as columns.

case_words <- case_words %>%
unnest() %>%
mutate(line_number = 1:n(),
 file = str_sub(basename(file), 1))
case_words$file <- forcats::fct_relevel(case_words$file, c("evans", "miller", "donoghue"))

We get a tibble that looks like this:

# A tibble: 96,318 × 3
file line_numberword
<fctr> <int> <chr>
1evans.txt 1lord
2evans.txt 1 neuberger
3evans.txt 1with
4evans.txt 1whom
5evans.txt 1lord
6evans.txt 1kerr
7evans.txt 1 and
8evans.txt 1lord
9evans.txt 1reed
10 evans.txt 1 agree
# ... with 96,308 more rows

You can check the state of your table at this point by running,

head(case_words)

The last thing we need to do before we're ready to begin computing the sentiment of the data is to tokenise the words in our tibble:

case_words <- case_words %>%
unnest_tokens(word, text) 

SENTIMENT ANALYSIS OF THE JUDGMENTS

We are ready to start analysing the sentiment of the data. TidyText is armed with three different sentiment dictionaries, afinn, nrc and Bing. The first thing we're going to do is get a birds eye view of the different sentiment profiles of each judgment using the nrc dictionary and plot the results using ggplot:

case_words %>%
inner_join(get_sentiments("nrc")) %>%
group_by(index = line_number %/% 20, file, sentiment) %>%
summarize(n = n()) %>%
ggplot(aes(x = index, y = n, fill = file)) + 
geom_bar(stat = "identity", alpha = 0.7) + 
facet_wrap(~ sentiment, ncol = 3)

A bird-eye view of ten emotional profiles of each judgment

The x-axis of each graph represents the position within each document from beginning to end, the y-axis quantifies the intensity of the sentiment under analysis. 

We can get a closer look at the emotional fluctuation by plotting an analysis using the afinn and Bing dictionaries:

case_words %>% 
left_join(get_sentiments("bing")) %>%
left_join(get_sentiments("afinn")) %>%
group_by(index = line_number %/% 20, file) %>%
summarize(afinn = mean(score, na.rm = TRUE), 
bing = sum(sentiment == "positive", na.rm = TRUE) - sum(sentiment == "negative", na.rm = TRUE)) %>%
gather(lexicon, lexicon_score, afinn, bing) %>% 
ggplot(aes(x = index, y = lexicon_score, colour = file)) +
geom_smooth(stat = "identity") + 
facet_wrap(~ lexicon, scale = "free_y") +
scale_x_continuous("Location in judgment", breaks = NULL) +
scale_y_continuous("Lexicon Score")

Sentiment curves using afinn and bing sentiment dictionaries

FINDINGS

The Bing analysis (pictured right), appears to provide a slightly more stable view. Instances moving above the zero-line indicate positive emotion, instances moving below the zero line indicate negative emotion.

For example, if we take the judgment in R (Miller), we can see that the first half of the judgment is broadly positive, but then dips suddenly around the middle of the judgment. The line indicates that the second half of the judgment text is slightly more negative than the first half, but rises to it's peak of positivity just before the end of the text.

The text of the judgment in Donoghue is considerably more negative. The curve sharply dips as the judgment opens, turns more positive towards the middle of the document, takes a negative turn once more and resolves to a more positive state towards the conclusion.

Algorithmically Topic Modelling Judgments

Like many others that work in the information/publishing sector, I have developed a keen interest in learning how to make use of machine learning and text mining technology to enhance the information I work on (in my working context, the information is case law). 

Over the Christmas break I started to experiment with a statistical programming language called R, which has a decent suite of text mining functionality available right out of the box.  

I wanted to put R to use to tackle a simple and practical question: is it possible to accurately classify a judgment algorithmically with relative ease?

In order to test R against this particular use-case, I constructed a simple experiment.

Build sample corpus of data

To run the experiment I needed a small batch of sample judgments. I selected nine recent judgments from BAILII: three from the Criminal Division of the Court of Appeal; three from the Family Court; and three from the Commercial Court. 

Success Factors

For R to be successful in the experiment, it would need to algorithmically classify the nine cases to the correct topic, i.e. the three criminal cases should be grouped together, as should the commercial cases, etc. 

Method

I’ll write up more detailed notes on the method and code used in this experiment, but essentially the following steps would be taken:

  1. Load the nine judgments into R as text files
  2. Pre-process the text files to remove unwanted material (like punctuation, numbers and standard stop words).
  3. Analyse the most frequently occurring terms in the corpus of judgments and remove additional stop words that would be common across the entire dataset.
  4. Apply a topic modelling algorithm (the Latent Dirichlet allocation (LDA) model) to the corpus of judgments to algorithmically allocate each of the judgments to one of three topics (criminal, family or commercial).
  5. Match up the judgments to a topic
  6. Produce a matrix of the key terms governing allocation into a topic
  7. Produce a matrix detailing the respective probability for each case allocation to a topic.

Results

The LDA algorithm did a decent job of allocating judgments to one of the three topics. First off, we can take a look at the key terms the model has used to allocate each case to a topic. The first column in the chart below, for example, sets out the terms viewed as most relevant to allocate a judgment to the family topic. 

Family Commercial Criminal
1 children claus court
2 order claim evid
3 court polici appel
4 evid vote appeal
5 child period case
6 made manag convict

Now let's look at the actual allocation of judgment to topics:

V1
a.txt Fam
adamantine.txt Comm
arc.txt Comm
canary.txt Comm
f.txt Fam
n.txt Fam
r_v_amjad.txt Crim
r_v_burke.txt Crim
r_v_garland.txt Crim

Each of the nine judgments has been allocated to the correct topic. We can also have a deeper look at the probability for each allocation.

Family Commercial Criminal
a.txt 0.66 0.13 0.21
adamantine 0.05 0.92 0.04
arc.txt 0.06 0.90 0.04
canary.txt 0.05 0.91 0.04
f.txt 0.82 0.03 0.15
n.txt 0.86 0.03 0.11
r_v_amjad.txt 0.11 0.08 0.82
r_v_burke.txt 0.14 0.10 0.76
r_v_garland.txt 0.12 0.02 0.85

Conclusion

This experiment shows that it is possible to use a language like R to accurately fit a topic model onto the text of judgments. However, there are two obvious limitations associated with this particular implementation of text classification.

First, the number of topics needs to be defined a priori. That's absolutely fine where you have an idea of the number of topics in advance of the modelling (as in the case of this experiment), but if you don't know how many topics there are in the corpus, you'll probably have to run the model more the once and experiment with the number of topics. 

The second issue is one of scalability. This experiment used a small corpus (only nine judgments). The larger and more varied the corpus, the more heavy lifting is required to stage the data well. 

However, regardless of these limitations, the fantastic thing about this modelling approach is the ease with which it can be deployed.