Artboard 2

As someone involved in the ongoing development of an online legal research system (the ICLR's ICLR.3 platform), I spend quite a bit of time thinking about the ways in which unstructured or partially structured legal texts can be enriched and brought to order, either to prepare the text for later processing in a content delivery pipeline or for some other form of data analysis. 

More often than not, rendering a text amenable for content delivery or data analysis involves a fair amount of wrangling with the text itself to markup entities of interest and to apply an overall schematic structure to document.

Legal publishers, such as ICLR, Justis, LexisNexis and Thomson Reuters use industrial-strength proprietary tools and teams of people to wrangle unstructured legal material into a form that can be used in their products and services. However, the pool of individuals and companies interested in leveraging legal texts has exploded well beyond a handful of well-established legal publishers. 

In my opinion, the more people playing with legal information and sharing their work the better. So, I've started development on my very first open source project to produce a suite of tools, written in Python, that can be used to perform a wide range of legal text enrichment operations. I call the project Blackstone.


The idea behind Blackstone is relatively simple: it should be easier to perform a standard set of extraction and enrichment tasks without first having to write custom code to get the job done. The objective of the library is to provide a free set of tools that can be used to:

  • Automatically segment the input text into sentences and mark them up

  • Identify and markup references to primary and secondary legislation

  • Identify and markup references to case law

  • Identify and markup axioms (e.g. where the author of the text postulates that such and such is an "established principle of law" etc)

  • Identify other types of entities peculiar to legal writing, such as courts, indictment numbers

  • Produce document level metrics, providing an overview of the document's structure, characteristics and content

  • Generation of visualisations

  • Other stuff I haven't thought of yet

Crucially, Blackstone is not intended to be a standalone service. Rather, the intention is to provide a suite of ready-baked Python tools that can be used out of the box in other development or data science pipelines. 

As an open source library, Blackstone stands on the shoulders of world-class, open Python technologies: spaCy, scikit-learn, BeautifulSoup, pandas, requests and, of course, Python's own standard library. Blackstone couples intuitive high-level abstractions of these underlying technologies with custom built constructs designed specifically to deal with legal content.

Progress and horizon

The plan is to get an initial Beta release out on GitHub and PyPi by the end of September 2018. To date, the following progress has been made:

  • Function to provide high-level abstraction over spaCy sentence segmentation (testing)

  • Function to assemble comprehensive list of UK statutes (complete)

  • Function to detect and markup primary legislation by reference to short title (complete)

  • Function to detect and markup primary legislation by reference to abbreviation (e.g. DPA or DPA 1998) (testing)

  • Function to resolve oblique references to primary legislation (e.g. the 1998 Act) (developing).

Once I've got a baseline level of functionality completed, I'll release the code on GitHub. More updates to follow.

If you'd like to get involved, share an idea or give me some help, drop me a line on Twitter.

Using scikit-Learn on your own text data

Scikit-learn’s Working with Text Data provides a superb starting point for learning how to harness the power and ease of the sklearn framework for the construction of really powerful and accurate predictive models over text data. The only problem is that scikit-learn’s extensive documentation (and, be in no doubt, the documentation is phenomenal) doesn’t help much if you want to apply a cross-validated model on your own text data

At some point, you’re going to want to move away from experimenting with one of the built-in datasets (e.g. twentynewsgroups) and start doing data science on textual material you understand and care about. 

The purpose of this tutorial is to demonstrate the basic scaffold you need to build to apply the power of scikit-learn to your own text data. I’d recommend methodically working your way through the Working with Text Data tutorial before diving in here, but if you really want to get cracking, read on.

If you can't be bothered reading on and just want to see the code, it's in a repo on GitHub, here.


Before we start, let’s be clear about what we’re trying to do. We have a great big collection of text documents (ideally as plain text from the offing). Our documents are, to use the twentynewsgroups example, all news articles. The news articles have been grouped together, in directories, by their subject matter. We might have one subdirectory consisting of technology articles, called Technology. We might have another subdirectory consisting of articles about tennis, called Tennis

Our project directory might look like this (assume each subdirectory has 100 text documents inside):

news_articles \

The aim of the game is to use this data to train a classifier that is capable analysing a new, unlabelled article and determining which bucket to put it in (this is an article about food, this is an article about business, etc). 

What our code is going to do

We’re going to write some code, using scikit-learn, that does the following:

  • Loads our dataset of news articles and categorises those articles according to the name of the folder they live in (e.g. art, food, tennis)
  • Splits the dataset into two chunks: a chunk we’re going to use to train our classifier and another chunk that we’re going to use to test how good the classifier is
  • Converts the training data into a form the classifier can work with
  • Converts the test data into a form the classifier can work with
  • Builds a classifier 
  • Applies that classifier to our training data
  • Fires the test data into our trained classifier
  • Tells us how well the classifier did at predicting the right label (art, food, tennis etc) of the each document in the test dataset

1. Get the environment ready

The first job is to bring in everything we need from scikit-learn:

import sklearn
import numpy as np
from glob import glob
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline 

That’s the stage set in terms of bringing in our dependencies.

2. Set our categories

The next job is to state the names of the categories (our folders of grouped news articles) in a list. These need to exactly match the names of the subdirectories acting as the categorical buckets in your project directory.

categories = [‘art’, ‘business’, ‘culture’, ‘design’, ‘food’, ‘technology’, ‘tennis’, ‘war’]

This approach of manually setting the folder names works well if you only have a few categories or you’re just using a small sample of a larger set of categories. However, if you’ve got lots of category folders, manually entering them as list items is going to be a bore and will make your code very, very ugly (I'll write a separate blog post on a better way of dealing with this, or look at the repo on GitHub, which incorporates the solution to this problem).

3. Load the data

We’re now ready to load our data in:

docs_to_train = sklearn.datasets.load_files(“/path/to/the/project/folder/“, 
    description=None, categories=categories, 
    load_content=True, encoding='utf-8', shuffle=True, random_state=42)

All we’re doing here is saying that our dataset, docs_to_train, consist of the files contained within all of the subdirectories to the path specified inside the .load_files function and that the categories are the categories set out in our categories list (see above). Forget about the other stuff in there for now.

4. Split the dataset we’ve just loaded into a training set and a test set

This is where the real work begins. We’re going to use the entire dataset, docs_to_train, to both train and test our classifier. For this reason, we’ve got to split the dataset into two chunks: one chunk for training and another chunk (that the classifier won’t get to look at in training) for testing. We’re going to “hold out” 40% of the the dataset for testing:

X_train, X_test, y_train, y_test = train_test_split(,, test_size=0.4)

It’s really important to understand what this line of code is doing. 

First, we’re creating four new objects, X_train, X_test, y_train and y_test. The X objects are going to hold our data, the content of the text files. We’ve got one X object, X_train, and that will hold the text file data we’ll use to train the classifier. We have another X object, X_test, and that will hold the text file data we’ll use to test the classifier. The Xs are the data.

The we have the Ys. The Y objects hold the category names (art, culture, war etc). y_train will hold the category names that correspond to the text data in X_train. y_test will hold the category names category names that correspond to the text data in X_test. The y value are the targets. 

Finally, we’re using test_size=0.4 to say that out of all the data in docs_to_train we want 40% to be held out for the test data in X_test and y_test.

5. Transform the training data into a form the classifier can work with

Our classifier uses mathematics to determine whether Document X belongs in bucket A, B, or C. The classifier therefore expects numeric data rather than text data. This means we’ve got to take our text training data, stored in X_train, and transform it into a form our classifier can work with. 

count_vect = CountVectorizer(stop_words='english')

X_train_counts = count_vect.fit_transform(raw_documents=X_train)

These two lines are doing are a lot of heavy lifting and I would strongly urge you to go back to the Working with Text Data tutorial to fully understand what’s going on here.

The first thing we’re doing is setting up a vectoriser, CountVectorizer(). This is a function that will count the number of times each word in the dataset occurs and project that count into a vector. 

Then, we take that vector and apply it to the training data stored in X_train. We store those occurrence vectors in X_train_counts.

Once that’s done we move on to the clever transformation bit. We’re going to take the occurrence counts, stored in X_train_counts, and transform them into a term frequency inverse document frequency value. 

tfidf_transformer = TfidfTransformer(use_idf=True)

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

Why are we doing this? Well, if you think about it, the documents in your training set will naturally vary in word length; some are going to be long, others are going to be short. Longer documents have more words in them and that’s going to result in a higher word count for each word. That’s going to skew your results. What we really want to do is get get a sense of the count of each word proportionate to the number of words in the document. Tf-idf (term frequency inverse document frequency achieves this). 

6. Transform the test data into a form the classifier can work with

Since we’ve gone to the trouble of splitting the dataset into a training set and a test set, we also need to transform our test data in exactly the same way as we just did with the training set. All we’re doing here is mirroring the transformation process we just applied to X_train onto X_test. 

count_vect = CountVectorizer(stop_words='english')
X_test_counts = count_vect.fit_transform(raw_documents=X_test)

tfidf_transformer = TfidfTransformer(use_idf=True)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)

7. Scikit-learn gives us a far better way to deal with these transformations: pipelines!

It was worth reading about the transformation process, because if you’re working with text data and trying to do science with it you really do need to at least see why and how that text is transformed into a numerical form a predictive classifier can deal with. 

However, scikit-learn actually gives us a far more efficient way (in terms of lines of code) to deal with the transformations — it’s called a pipeline. The pipeline is this example has three phases. The first creates the vectoriser — the machine used to turn our text into numbers — a count of occurrences. The second phase deals with transforming the crude vectorisation handled in the first into a frequency-based representation of the data — the term frequency inverse document frequency. Finally, and most excitingly, the third phase of the pipeline sets up the classifier — the machine that’s going to train the model. 

Here’s the pipeline code:

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer(use_idf=True)),
    ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, 

For now, don’t worry about the parameters set out in the classifier, just appreciate the structure and content of the pipeline. 

8. Deploy the pipeline and train the model

Now it’s time to train our model by applying the pipeline we’ve just built to our training data. All we’re doing here is taking our training data (X_train) and the corresponding training labels (y_train) and passing them into the fit function that comes built into the pipeline., y_train)

Depending on how big your dataset is, this could take a few minutes or a bit longer. 

As a sidenote, on this point, you might have noticed that I set the verbose parameter in the classifier as 1. This is purely so I can see that the classifier is running and that the script isn't hanging because I’m chewing through memory. 

9. Test the model we’ve just trained using the test data

We’ve now trained our model on the training data. It’s time to see how well trained the model really is by letting it loose on our test data. What we’re going to do is take the test data, X_test, let the model evaluate it based on what it learned from being fed the training data (X_train) and the training labels (y_train) and see what categories the model predicts the test data belongs to. 

predicted = text_clf.predict(X_test)

We can measure the model’s accuracy by taking the mean of the classifier’s predictive accuracy like so:

print (np.mean(predicted == y_test))

Better yet, we can use scikit-learn’s built in metrics library to give us some detailed performance statistics by category (i.e. how well did the classifier at predicting that an article about “design” is an article about “design”.

print(metrics.classification_report(y_test, predicted, 

The metrics will provide you with a precision score for each category and an overall average of the performance of the model between 0 and 1. 

The closer that average score gets to 1, the better the model will perform. Mind you, if your model is averaging a score of 1 on the nose, something has gone wrong!   

Rapid Keyword Extraction of Donoghue v Stevenson

Sometimes it would be really handy to be able to quickly and accurately extract keywords from a large corpus of documents. It is quite easy to foresee such a use-case arising in legal publishing, for example. 

RAKE (Rapid Keyword Extraction), is a Python natural language processing module that goes a long way in dealing with this use-case. 

I was interested in putting RAKE to the test and thought I'd pit the algorithm against what is perhaps to most well known piece of case law in the common law world: Donoghue v Stevenson (of snail and ginger beer fame). 

What follows is the basic "working out" of the code and the results of the first pass. For anyone interested in replicating this experiment or doing some keyword extraction of their own, see this excellent tutorial - you'll see that my own code follows it closely.


import rake 
import operator


rake_object = rake.Rake("smartstoplist.txt", 5, 5, 7)

This line of code does the following:

  • Creates a RAKE object that extracts keywords where (i) each word has at least 5 characters; (ii) each phrase must have at least 5 words; and (iii) each keyword must appear in the text at least 7 times
  • Hits the text file with a list of stop words to remove textual noise


Now we open the text file (in this test, I've saved the judgment in Donoghue as a text file) and save it in a variable:

judgment = open("dono.txt","r") 
text =


Now we're ready to run RAKE over the text to get the keywords:

keywords = 
print (keywords)


The following keywords (along with their scores) were returned:

[('give rise', 4.300000000000001), ('common law', 4.184313725490196), ('duty owed', 4.154061624649859), ('ordinary care', 4.115278543849972), ('reasonable care', 4.093482554312047), ('skivington lr 5', 4.050000000000001), ('lake & elliot', 4.0), ('pender 11 qb', 3.966666666666667), ('present case', 3.7993197278911564), ('defective', 1.7619047619047619), ('present', 1.7380952380952381), ('principles', 1.7333333333333334), ('dangerous', 1.6491228070175439), ('exercise', 1.588235294117647), ('cases', 1.5875), ('bottles', 1.5833333333333333), ('liability', 1.5789473684210527), ('relationship', 1.5555555555555556), ('court', 1.5365853658536586), ('supplying', 1.5), ('appears', 1.4761904761904763), ('principle', 1.4736842105263157), ('allowed', 1.4545454545454546), ('party', 1.4375), ('nature', 1.4210526315789473), ('warranty', 1.4166666666666667), ('goods', 1.4090909090909092), ('thing', 1.4090909090909092), ('articles', 1.4), ('condition', 1.4), ('appellant', 1.3953488372093024), ('injured', 1.3863636363636365), ('alleged', 1.375), ('bought', 1.3636363636363635), ('stated', 1.3636363636363635), ('examination', 1.3636363636363635), ('opportunity', 1.3636363636363635), ('appeal', 1.3333333333333333), ('support', 1.3333333333333333), ('defect', 1.3333333333333333), ('decided', 1.3333333333333333), ('relation', 1.3333333333333333), ('bottle', 1.3225806451612903), ('matter', 1.3125), ('authorities', 1.3125), ('injury', 1.3076923076923077), ('carelessness', 1.3076923076923077), ('judgment', 1.3055555555555556), ('proposition', 1.3043478260869565), ('recover', 1.3), ('referred', 1.3), ('circumstances', 1.2972972972972974), ('supplied', 1.2857142857142858), ('found', 1.2857142857142858), ('based', 1.2777777777777777), ('defendant', 1.2666666666666666), ('liable', 1.263157894736842), ('article', 1.26), ('manufactured', 1.25), ('lordships', 1.25), ('danger', 1.25), ('means', 1.25), ('poison', 1.25), ('inspection', 1.2307692307692308), ('purchaser', 1.2272727272727273), ('george', 1.2272727272727273), ('person', 1.2222222222222223), ('courts', 1.2222222222222223), ('house', 1.2105263157894737), ('plaintiff', 1.2096774193548387), ('chattel', 1.2), ('decision', 1.1935483870967742), ('entitled', 1.1818181818181819), ('authority', 1.1666666666666667), ('vendor', 1.1666666666666667), ('dicta', 1.1666666666666667), ('premises', 1.1538461538461537), ('repair', 1.1538461538461537), ('question', 1.1515151515151516), ('pursuer', 1.1428571428571428), ('manufacturer', 1.1384615384615384), ('facts', 1.1333333333333333), ('persons', 1.1333333333333333), ('subject', 1.125), ('class', 1.125), ('scotland', 1.125), ('evidence', 1.125), ('manufacturers', 1.125), ('defender', 1.125), ('contents', 1.1176470588235294), ('words', 1.1), ('longmeid', 1.1), ('holliday 6', 1.1), ('exist', 1.1), ('consequence', 1.1), ('negligence', 1.0985915492957747), ('contract', 1.0918367346938775), ('difficult', 1.0833333333333333), ('proved', 1.0833333333333333), ('respect', 1.0833333333333333), ('respondent', 1.08), ('consumer', 1.0789473684210527), ('proof', 1.0714285714285714), ('regard', 1.0714285714285714), ('manufacture', 1.0666666666666667), ('knowledge', 1.0666666666666667), ('england', 1.0588235294117647), ('langridge', 1.0555555555555556), ('action', 1.0476190476190477), ('opinion', 1.0357142857142858), ('lords', 1.0), ('ginger', 1.0), ('retailer', 1.0), ('result', 1.0), ('neglect', 1.0), ('division', 1.0), ('ground', 1.0), ('fraud', 1.0), ('judgments', 1.0), ('parke', 1.0), ('levy 2', 1.0), ('winterbottom', 1.0), ('wright 10', 1.0), ('stranger', 1.0), ('coach', 1.0), ('reason', 1.0), ('blacker', 1.0), ('breach', 1.0), ('skill', 1.0), ('parties', 1.0), ('brett', 1.0), ('heaven', 1.0), ('point', 1.0), ('treated', 1.0), ('property', 1.0), ('purpose', 1.0), ('thought', 1.0), ('existence', 1.0), ('pointed', 1.0), ('argument', 1.0), ('defendants', 1.0), ('hamilton', 1.0), ('contention', 1.0), ('mullen', 1.0), ('barr &', 1.0), ('defenders', 1.0), ('members', 1.0), ('remote', 1.0), ('bridge', 1.0)]

I was fairly chuffed with these results given it was the first attempt. The key seems to be getting the right balance of parameters when setting the object up. But, it's good to see terms like duty owed and reasonable care appearing at the top of the results. 

It definitely needs some fine tuning and probably an expansion of the stop list, but it's a good start.