# Algorithmically Topic Modelling Judgments

Like many others that work in the information/publishing sector, I have developed a keen interest in learning how to make use of machine learning and text mining technology to enhance the information I work on (in my working context, the information is case law).

Over the Christmas break I started to experiment with a statistical programming language called R, which has a decent suite of text mining functionality available right out of the box.

I wanted to put R to use to tackle a simple and practical question: is it possible to accurately classify a judgment algorithmically with relative ease?

In order to test R against this particular use-case, I constructed a simple experiment.

### Build sample corpus of data

To run the experiment I needed a small batch of sample judgments. I selected nine recent judgments from BAILII: three from the Criminal Division of the Court of Appeal; three from the Family Court; and three from the Commercial Court.

### Success Factors

For R to be successful in the experiment, it would need to algorithmically classify the nine cases to the correct topic, i.e. the three criminal cases should be grouped together, as should the commercial cases, etc.

### Method

I’ll write up more detailed notes on the method and code used in this experiment, but essentially the following steps would be taken:

1. Load the nine judgments into R as text files
2. Pre-process the text files to remove unwanted material (like punctuation, numbers and standard stop words).
3. Analyse the most frequently occurring terms in the corpus of judgments and remove additional stop words that would be common across the entire dataset.
4. Apply a topic modelling algorithm (the Latent Dirichlet allocation (LDA) model) to the corpus of judgments to algorithmically allocate each of the judgments to one of three topics (criminal, family or commercial).
5. Match up the judgments to a topic
6. Produce a matrix of the key terms governing allocation into a topic
7. Produce a matrix detailing the respective probability for each case allocation to a topic.

### Results

The LDA algorithm did a decent job of allocating judgments to one of the three topics. First off, we can take a look at the key terms the model has used to allocate each case to a topic. The first column in the chart below, for example, sets out the terms viewed as most relevant to allocate a judgment to the family topic.

Family Commercial Criminal
1 children claus court
2 order claim evid
3 court polici appel
4 evid vote appeal
5 child period case

Now let's look at the actual allocation of judgment to topics:

V1
a.txt Fam
arc.txt Comm
canary.txt Comm
f.txt Fam
n.txt Fam
r_v_burke.txt Crim
r_v_garland.txt Crim

Each of the nine judgments has been allocated to the correct topic. We can also have a deeper look at the probability for each allocation.

Family Commercial Criminal
a.txt 0.66 0.13 0.21
arc.txt 0.06 0.90 0.04
canary.txt 0.05 0.91 0.04
f.txt 0.82 0.03 0.15
n.txt 0.86 0.03 0.11