I wrote a post about a machine learning classification problem I’ve been working on last week and how I’m using a visual method, tSNE projection, to help improve the classifier’s accuracy.
The classification problem, in general terms, runs along these lines. It is normally useful, when working with large bodies of text, to be able to assign a topic to each document in the corpus. Large bodies of judgments are no exception — we want to be able to say, “this case is about negligence” or “this case is about contract” or “this is a defamation case” and so on.
Traditionally, this classification problem has been performed by people who are well trained in the legal domain. A new judgment comes in, someone reads it and then assigns it to some branch in an established taxonomy.
One of the problems we encounter today is the sheer volume of material coming through. To perform the classification job manually would require a fair bit of skilled human resource. And, in many cases, the effort of classifying any given judgment in this way is potentially a waste of that resource, because most judgments don’t disclose much of major interest to the development of the law (that isn’t to say they shouldn't be made available, though).
So, against that background, we may decide that we want to devise an automated process of assigning new judgments to a branch of the taxonomy. We’ll build a model trained on the existing taxonomy and the judgments indexed in that taxonomy and new judgments can be analysed against that model to determine the branch it belongs to.
Some classification models do binary classification: is this thing “this” or "that”? The classic example of a binary classification model is the model in an email client that marks incoming emails as “Spam” or “Ham” (ham being a genuine email). In contrast, the classification model we’d need to train to assign a legal topic to a judgment is a multi-class classification model — we have multiple topics (the classes) to deal with.
The classifier I’m working on uses a proprietary dataset derived from cases reported by the ICLR. The initial training set was built by:
Amassing lots and lots of case reports in XML format
Extracting the text of the judgment (bits like the headnote were deliberately excluded because the wouldn’t be present in the the judgments to be classified)
Moving each extracted judgment to a folder named according to the highest taxonomy term in the catchwords of the original report (so, a case with catchwords like “CRIME — Evidence — Hearsay” would live in a folder called “Crime”).
All in all, I ended up with approximately 240 classes. The problem is that the taxonomy from which these classes were derived wasn’t designed for use in an MCC. It works great in the setting it was designed for, where a human is interpreting and applying the taxonomy, but for a machine learning problem, it presents a number of challenges that need to be tackled before it’s used for training the classifier.
For an MCC to be effective at making accurate predictions, the target classes need to be mutually exclusive. In other words, the instances in which one class overlaps with another need to be as close to zero as possible.
There is a lot of overlap in the training set I’m working with. Some of it is easy to deal with at a glance: the class “European Union” can be merged with “European Community”, “Ecclesiastical” can be merged with “Ecclesiastical Law”. That stuff is pretty painless.
Things get tricky when even after we’ve merged and pruned the classes, large areas of overlap remain. What we really need is a way to get sight of how the documents in the training set, rather than the classes, overlap.
We load our corpus of text data (in my case, the judgments)
We convert the text in the corpus into numerical form (machine learning is basically about maths, we need numbers to do that maths, not text). This process is called vectorisation.
The vectorisation process produces a matrix where there is a column for each unique word in the corpus (in machine learning lingo, this is called a “feature”) and each row represents a document.
This resulting matrix is highly dimensional, which means it needs to be decomposed into a two or three dimensional space making it suitable for plotting on a chart whilst preserving as much of the dimensionality in the data as possible.
This is really handy for a quick look at the data, but the lack of interactivity makes it hard to extract much more than a high-level view of how the documents are clustering.
So, I decided to get my hands dirty and build the same projection into a more interactive plot using Altair. Here it is (hover, pan and zoom).
The projections in this blog post are based on a fragment of the overall training set. But they help demonstrate why tSNE is useful. The goal is to identify areas where different classes occupy similar vector spaces and cluster together; to get to a point where each class forms a discrete cluster.
If you hover around the chart, you’ll see that “Sale of Goods” has formed a tight cluster within the larger “Contract” cluster. This makes sense, because cases involving sale of goods issues tend to involve concepts that arise in contract law. Similarly, “Occupiers’ Liability” has formed a cluster just outside the much larger “Negligence” cluster, which again makes sense.
The projection also makes clear that the classes aren’t well balanced, the “Crime” class contains many more samples (documents) than the other classes. That needs to be fixed before training.
Sometimes, we just need to get a good look at the data.