Like many others that work in the information/publishing sector, I have developed a keen interest in learning how to make use of machine learning and text mining technology to enhance the information I work on (in my working context, the information is case law).
Over the Christmas break I started to experiment with a statistical programming language called R, which has a decent suite of text mining functionality available right out of the box.
I wanted to put R to use to tackle a simple and practical question: is it possible to accurately classify a judgment algorithmically with relative ease?
In order to test R against this particular use-case, I constructed a simple experiment.
Build sample corpus of data
To run the experiment I needed a small batch of sample judgments. I selected nine recent judgments from BAILII: three from the Criminal Division of the Court of Appeal; three from the Family Court; and three from the Commercial Court.
For R to be successful in the experiment, it would need to algorithmically classify the nine cases to the correct topic, i.e. the three criminal cases should be grouped together, as should the commercial cases, etc.
I’ll write up more detailed notes on the method and code used in this experiment, but essentially the following steps would be taken:
- Load the nine judgments into R as text files
- Pre-process the text files to remove unwanted material (like punctuation, numbers and standard stop words).
- Analyse the most frequently occurring terms in the corpus of judgments and remove additional stop words that would be common across the entire dataset.
- Apply a topic modelling algorithm (the Latent Dirichlet allocation (LDA) model) to the corpus of judgments to algorithmically allocate each of the judgments to one of three topics (criminal, family or commercial).
- Match up the judgments to a topic
- Produce a matrix of the key terms governing allocation into a topic
- Produce a matrix detailing the respective probability for each case allocation to a topic.
The LDA algorithm did a decent job of allocating judgments to one of the three topics. First off, we can take a look at the key terms the model has used to allocate each case to a topic. The first column in the chart below, for example, sets out the terms viewed as most relevant to allocate a judgment to the family topic.
Now let's look at the actual allocation of judgment to topics:
Each of the nine judgments has been allocated to the correct topic. We can also have a deeper look at the probability for each allocation.
This experiment shows that it is possible to use a language like R to accurately fit a topic model onto the text of judgments. However, there are two obvious limitations associated with this particular implementation of text classification.
First, the number of topics needs to be defined a priori. That's absolutely fine where you have an idea of the number of topics in advance of the modelling (as in the case of this experiment), but if you don't know how many topics there are in the corpus, you'll probably have to run the model more the once and experiment with the number of topics.
The second issue is one of scalability. This experiment used a small corpus (only nine judgments). The larger and more varied the corpus, the more heavy lifting is required to stage the data well.
However, regardless of these limitations, the fantastic thing about this modelling approach is the ease with which it can be deployed.