Visually Clustering Case Law (Part 2)

I wrote a post about a machine learning classification problem I’ve been working on last week and how I’m using a visual method, tSNE projection, to help improve the classifier’s accuracy.

The classification problem, in general terms, runs along these lines. It is normally useful, when working with large bodies of text, to be able to assign a topic to each document in the corpus. Large bodies of judgments are no exception — we want to be able to say, “this case is about negligence” or “this case is about contract” or “this is a defamation case” and so on.

Traditionally, this classification problem has been performed by people who are well trained in the legal domain. A new judgment comes in, someone reads it and then assigns it to some branch in an established taxonomy.

One of the problems we encounter today is the sheer volume of material coming through. To perform the classification job manually would require a fair bit of skilled human resource. And, in many cases, the effort of classifying any given judgment in this way is potentially a waste of that resource, because most judgments don’t disclose much of major interest to the development of the law (that isn’t to say they shouldn't be made available, though).

So, against that background, we may decide that we want to devise an automated process of assigning new judgments to a branch of the taxonomy. We’ll build a model trained on the existing taxonomy and the judgments indexed in that taxonomy and new judgments can be analysed against that model to determine the branch it belongs to.

Some classification models do binary classification: is this thing “this” or "that”? The classic example of a binary classification model is the model in an email client that marks incoming emails as “Spam” or “Ham” (ham being a genuine email). In contrast, the classification model we’d need to train to assign a legal topic to a judgment is a multi-class classification model — we have multiple topics (the classes) to deal with.

Visualising Case Clusters2JudgmentModelContractCrimeDefamation

The classifier I’m working on uses a proprietary dataset derived from cases reported by the ICLR. The initial training set was built by:

  • Amassing lots and lots of case reports in XML format

  • Extracting the text of the judgment (bits like the headnote were deliberately excluded because the wouldn’t be present in the the judgments to be classified)

  • Moving each extracted judgment to a folder named according to the highest taxonomy term in the catchwords of the original report (so, a case with catchwords like “CRIME — Evidence — Hearsay” would live in a folder called “Crime”).

All in all, I ended up with approximately 240 classes. The problem is that the taxonomy from which these classes were derived wasn’t designed for use in an MCC. It works great in the setting it was designed for, where a human is interpreting and applying the taxonomy, but for a machine learning problem, it presents a number of challenges that need to be tackled before it’s used for training the classifier.

For an MCC to be effective at making accurate predictions, the target classes need to be mutually exclusive. In other words, the instances in which one class overlaps with another need to be as close to zero as possible.

There is a lot of overlap in the training set I’m working with. Some of it is easy to deal with at a glance: the class “European Union” can be merged with “European Community”, “Ecclesiastical” can be merged with “Ecclesiastical Law”. That stuff is pretty painless.

Things get tricky when even after we’ve merged and pruned the classes, large areas of overlap remain. What we really need is a way to get sight of how the documents in the training set, rather than the classes, overlap.

The previous post I did on this subject was about using a technique called TSNE (t-distributed stochastic neighbour embedding) projection. In ridiculously oversimplified terms, tSNE works like this:

  1. We load our corpus of text data (in my case, the judgments)

  2. We convert the text in the corpus into numerical form (machine learning is basically about maths, we need numbers to do that maths, not text). This process is called vectorisation.

  3. The vectorisation process produces a matrix where there is a column for each unique word in the corpus (in machine learning lingo, this is called a “feature”) and each row represents a document.

  4. This resulting matrix is highly dimensional, which means it needs to be decomposed into a two or three dimensional space making it suitable for plotting on a chart whilst preserving as much of the dimensionality in the data as possible.

A Python package called Yellowbrick has an excellent tSNE feature that spins up a projection using matplotlib.

tSNE Projection.png

This is really handy for a quick look at the data, but the lack of interactivity makes it hard to extract much more than a high-level view of how the documents are clustering.

So, I decided to get my hands dirty and build the same projection into a more interactive plot using Altair. Here it is (hover, pan and zoom).

The projections in this blog post are based on a fragment of the overall training set. But they help demonstrate why tSNE is useful. The goal is to identify areas where different classes occupy similar vector spaces and cluster together; to get to a point where each class forms a discrete cluster.

If you hover around the chart, you’ll see that “Sale of Goods” has formed a tight cluster within the larger “Contract” cluster. This makes sense, because cases involving sale of goods issues tend to involve concepts that arise in contract law. Similarly, “Occupiers’ Liability” has formed a cluster just outside the much larger “Negligence” cluster, which again makes sense.

The projection also makes clear that the classes aren’t well balanced, the “Crime” class contains many more samples (documents) than the other classes. That needs to be fixed before training.

Sometimes, we just need to get a good look at the data.

Visually clustering case law

I’ve been experimenting with a Python package called Yellowbrick, which provides a suite of visualisers built for gaining an insight into a dataset when working on machine learning problems.

One of my side projects at the moment is looking at ways in which an unseen case law judgment can be processed to determine its high level subject matter (e.g. Crime, Tort, Human Rights, Occupiers’ Liability etc) with text classification.

I did a quick experiment with Yellowbrick to visualise a small portion of the dataset I’m going to use to train a classifier. And here it is:

tSNE projection of 14 case law topic clusters using Yellowbrick

tSNE projection of 14 case law topic clusters using Yellowbrick

The chart above is called a TSNE (t-distributed stochastic neighbour embedding) projection. Each blob on the chart represents a judgment.

I was pretty pleased with this and several quick insights surfaced:

  • Crime may be a bit over represented in my training data, I’ll need to cut it back a bit

  • Some of the labels in the data may need a bit of merging, for example “false . imprisonment” can probably be handled by the “crime” data

  • There are a couple of interesting sub-clusters within the crime data (I’m guessing one of the clusters will be evidence and the other sentencing)

  • Human Rights as a topic sits right in the middle of the field between the crime cluster and the clusters of non-criminal topics

The code

Dependencies

from yellowbrick.text import TSNEVisualizer
from yellowbrick.style import set_palette
from sklearn.feature_extraction.text import TfidfVectorizer
import os

from sklearn.datasets.base import Bunch
from tqdm import tqdm
import matplotlib.pyplot as plt

set_palette('paired')

Load the corpus

def load_corpus(path):
    """
    Loads and wrangles the passed in text corpus by path.
    """

    # Check if the data exists, otherwise download or raise
    if not os.path.exists(path):
        raise ValueError((
            "'{}' dataset has not been downloaded, "
            "use the yellowbrick.download module to fetch datasets"
        ).format(path))

    # Read the directories in the directory as the categories.
    categories = [
        cat for cat in os.listdir(path)
        if os.path.isdir(os.path.join(path, cat))
    ]

    files  = [] # holds the file names relative to the root
    data   = [] # holds the text read from the file
    target = [] # holds the string of the category

    # Load the data from the files in the corpus
    for cat in categories:
        for name in os.listdir(os.path.join(path, cat)):
            files.append(os.path.join(path, cat, name))
            target.append(cat)

            with open(os.path.join(path, cat, name), 'r') as f:
                data.append(f.read())


    # Return the data bunch for use similar to the newsgroups example
    return Bunch(
        categories=categories,
        files=files,
        data=data,
        target=target,
    )  
    corpus = load_corpus('/Users/danielhoadley/Desktop/common_law_subset')

Vectorise and transform the data

tfidf  = TfidfVectorizer(use_idf=True)
docs   = tfidf.fit_transform(corpus.data)
labels = corpus.target

Generate the visualisation

tsne = TSNEVisualizer(size=(1080, 720),title="Case law clusters")
tsne.fit(docs, labels)
tsne.poof()




Case law network graph

I recently posted a few images of a network graph I built with Neo4j depicting the connections between English cases. This article serves as a quick write up on how the graph database and the visualisations where produced.

graph_cluster_distant.png

Data

The data driving the network graph was derived from a subset of XML versions of cases reported by the Incorporated Council of Law Reporting for England and Wales. I used a simple Python script to iterate over the files and capture (a) the citation (e.g. [2010] 1 WLR 1) associated with the file -- source; and (b) all of the citations to other cases within this file -- each outward citation from the source is the target. This was pulled into CSV format, like so:

Source,Target
[2015] 1 WLR 3238,[2015] AC 129
[2015] 1 WLR 3238,[2013] 1 WLR 366
[2015] 1 WLR 3238,[2011] 1 WLR 980

In the snippet of data above, [2015] 1 WLR 3238 can be seen to have CITED three cases, [2015] AC 129, [2013] 1 WLR 366 and [2011] 1 WLR 980. Moreover, [2015] AC 129 can be seen to have been CITED_BY [2015] 1 WLR 3238.

Importing the data into Neo4J

The data was imported into Neo4j with the following CYPHER query:

USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM "file:///citings.csv" AS row
MERGE (c:Case {Name:toString(row.Source)})
MERGE (d:Case {Name:toString(row.Target)})
MERGE (c) -[:CITED]-> (d)
MERGE (d) -[:CITED_BY] -> (c)

The query above is a standard import query that created a node (:Case) for each unique citation in the source data and then constructed two relationships, :CITED and :CITED_BY between each node where these relationships existed.

View of the a small portion of the graph from the Neo4j browser

View of the a small portion of the graph from the Neo4j browser

Calculating the transitive importance of the cases in the graph

With the graph pretty much built, I wanted to get a sense of the most important case in the graph and the PageRank algorithm was used to achieve this:

CALL algo.pageRank('Case', 'CITED_BY',{write: true, writeProperty:'pagerank'})

This stored each case's PageRank as a property, pagerank, on the case node.

It was then possible to identify the ten most important cases in the network by running:

MATCH (c:Case) 
RETURN c.Name, c.pagerank 
ORDER BY c.pagerank DESC LIMIT 10

Which returned:

c.Name,c.pagerank
[2014] 3 WLR 535,15.561027
[2016] Bus LR 1337,13.3335
[2009] 3 WLR 369,11.5683645
[2000] 1 WLR 2068,11.149255000000002
[2009] 3 WLR 351,10.952590499999998
[1996] 1 WLR 1460,10.657869999999999
[2002] 2 WLR 578,9.848398000000001
[2000] 3 WLR 1855,9.2526755
[2005] 1 WLR 2668,8.36525
[2005] 3 WLR 1320,7.990162000000001

Visualising the graph

To render the graph in the browser, I used [neovis.js][1]. The code for the browser render:

<html>
    <head>
        <title>DataViz</title>
        <style type="text/css">
            body {font-family: 'Gotham' !important}
            #viz {
                width: 900px;
                height: 700px;
            }
        </style>
        <script src="https://rawgit.com/neo4j-contrib/neovis.js/master/dist/neovis.js"></script>
    </head>   
    <script>
        function draw() {
            var config = {
                container_id: "viz",
                server_url: "bolt://localhost:7687",
                server_user: "beans",
                server_password: "sausages",
                labels: {
                    "Case": {
                        caption: "Name",
                        size: "pagerank",
                    }
                },
                relationships: {
                    "CITED_BY": {
                        caption: false,                           
                 }
                },
                initial_cypher: "MATCH p=(:Case)-[:CITED]->(:Case) RETURN p LIMIT 5000"
            }
            var viz = new NeoVis.default(config);
            viz.render();
        }
    </script>
    <body onload="draw()">
        <div id="viz"></div>
    </body>
</html>
Visualisation with neovis.js

Visualisation with neovis.js

To add colour to the various groups of cases in the graph, I used a hacky implementation of the label propogation community detection algorithm (I say hacky, because I didn't set any seed labels).

CALL algo.labelPropagation('Case', 'CITED_BY','OUTGOING',
  {iterations:10,partitionProperty:'partition', write:true})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, write, partitionProperty;

The neovis.js could then by updated with a "community" attribute to generate different colours for each community of cases:

<html>
    <head>
        <title>DataViz</title>
        <style type="text/css">
            body {font-family: 'Gotham' !important}
            #viz {
                width: 900px;
                height: 700px;
            }
        </style>
        <script src="https://rawgit.com/neo4j-contrib/neovis.js/master/dist/neovis.js"></script>
    </head>   
    <script>
        function draw() {
            var config = {
                container_id: "viz",
                server_url: "bolt://localhost:7687",
                server_user: "sausages",
                server_password: "beans",
                labels: {
                    "Case": {
                        caption: "Name",
                        size: "pagerank",
                        community: "partition"
                    }
                },
                relationships: {
                    "CITED_BY": {
                        caption: false,    
                    }
                },
                initial_cypher: "MATCH p=(:Case)-[:CITED]->(:Case) RETURN p LIMIT 5000"
            }
            var viz = new NeoVis.default(config);
            viz.render();
        }
    </script>
    <body onload="draw()">
        <div id="viz"></div>
    </body>
</html>

Blackstone

Artboard 2

As someone involved in the ongoing development of an online legal research system (the ICLR's ICLR.3 platform), I spend quite a bit of time thinking about the ways in which unstructured or partially structured legal texts can be enriched and brought to order, either to prepare the text for later processing in a content delivery pipeline or for some other form of data analysis. 

More often than not, rendering a text amenable for content delivery or data analysis involves a fair amount of wrangling with the text itself to markup entities of interest and to apply an overall schematic structure to document.

Legal publishers, such as ICLR, Justis, LexisNexis and Thomson Reuters use industrial-strength proprietary tools and teams of people to wrangle unstructured legal material into a form that can be used in their products and services. However, the pool of individuals and companies interested in leveraging legal texts has exploded well beyond a handful of well-established legal publishers. 

In my opinion, the more people playing with legal information and sharing their work the better. So, I've started development on my very first open source project to produce a suite of tools, written in Python, that can be used to perform a wide range of legal text enrichment operations. I call the project Blackstone.

Blackstone

The idea behind Blackstone is relatively simple: it should be easier to perform a standard set of extraction and enrichment tasks without first having to write custom code to get the job done. The objective of the library is to provide a free set of tools that can be used to:

  • Automatically segment the input text into sentences and mark them up

  • Identify and markup references to primary and secondary legislation

  • Identify and markup references to case law

  • Identify and markup axioms (e.g. where the author of the text postulates that such and such is an "established principle of law" etc)

  • Identify other types of entities peculiar to legal writing, such as courts, indictment numbers

  • Produce document level metrics, providing an overview of the document's structure, characteristics and content

  • Generation of visualisations

  • Other stuff I haven't thought of yet

Crucially, Blackstone is not intended to be a standalone service. Rather, the intention is to provide a suite of ready-baked Python tools that can be used out of the box in other development or data science pipelines. 

As an open source library, Blackstone stands on the shoulders of world-class, open Python technologies: spaCy, scikit-learn, BeautifulSoup, pandas, requests and, of course, Python's own standard library. Blackstone couples intuitive high-level abstractions of these underlying technologies with custom built constructs designed specifically to deal with legal content.

Progress and horizon

The plan is to get an initial Beta release out on GitHub and PyPi by the end of September 2018. To date, the following progress has been made:

  • Function to provide high-level abstraction over spaCy sentence segmentation (testing)

  • Function to assemble comprehensive list of UK statutes (complete)

  • Function to detect and markup primary legislation by reference to short title (complete)

  • Function to detect and markup primary legislation by reference to abbreviation (e.g. DPA or DPA 1998) (testing)

  • Function to resolve oblique references to primary legislation (e.g. the 1998 Act) (developing).

Once I've got a baseline level of functionality completed, I'll release the code on GitHub. More updates to follow.

If you'd like to get involved, share an idea or give me some help, drop me a line on Twitter.

Part 3: Open Access To English Case Law (The Raw Data)

I started writing in the spring of this year about the state of open access to case law in the UK, with a particular focus on judgments given in the courts of England and Wales. 

The gist of my assessment of the state of open access to judgments via the British open law apparatus is set out here, but boils down to:

  • Innovation in the open case law space in the UK is stuck in the mud
  • BAILII is lagging behind comparable projects taking place elsewhere in the common law world: CanLII and CaseText are excellent examples of what's possible.
  • Insufficient focus, if any, is being directed to improving open access to English case law.

In a subsequent article, I explored the value in providing open and free online access to the decisions of judges. I identified four bases upon which open access can be shown to be a worthwhile endeavour: (i) the promotion of the rule of law; (ii) equality of arms, particularly for self-represented litigants; (iii) legal dispute reduction; and (iv) transparency.

In the same article, I developed a rough and ready definition of what "open access to case law":

"Open access to case law" isn't a "thing", it's a goal. The goal, at least to my mind, boils down to providing access that is free at the point of delivery to the text of every judgment given in every case by every court of record (i.e. every court with the power to give judgments that have the potential to be binding on lower and co-ordinate courts) in the jurisdiction.

My overriding concern is that a significant number of judgments do not make their way to BAILII and are only accessible to paying subscribers of subscription databases, effectively creating a "have and have nots" scenario where comprehensive access to the decisions of judges depends on the ability to pay for it. The gaps in BAILII's coverage were discussed in this article.

In this article I go deeper into exploring how big the gaps are in BAILII's coverage when compared to the coverage of judgments provided by three subscription-based research platforms: JustisOne, LexisLibrary and WestlawUK. 

Aim

The aim of the study was gather data on the coverage provided by BAILII, JustisOne, LexisLibrary and WestlawUK of judgments given in the following courts between 2007 and 2017:

  • Administrative and Divisional Court
  • Chancery Division
  • Court of Appeal (Civil Division)
  • Court of Appeal (Criminal Division)
  • Commercial Court
  • Court of Protection
  • Family Court
  • Family Division
  • Patents Court
  • Queen's Bench Division
  • Technology and Construction Court

Methodology

The way in which year-on-year counts of judgments given in a given court are handled by each of the four platforms varies from platform to platform. Accordingly, the following method was devised to extract the data from each platform:

BAILII

BAILII provides an interface to browse its various databases. Within each database, it is possible to isolate a court and a year. The page for a given year of a given court sets out a list of the judgments for that year.

Each judgment appears in the underlying HTML as a list element (<li> ... </li>). For example,

<li><a href="/ew/cases/EWCA/Crim/2017/17.html">Abi-Khalil &amp; Anor, R v </a><a title="Link to BAILII version" href="/ew/cases/EWCA/Crim/2017/17.html">[2017] EWCA Crim 17</a> (13 January 2017)</li>

A count of the total number of each <li> ... </li> on each pages yields the total count of judgments.

Justisone, lexislibrary & westlawuk

The three subscriber platforms were approached differently. A list of search strategies based on the neutral citation for each court was constructed.

For example, to query judgments given in the Criminal Division of the Court of Appeal in 2017, the following query was constructed:

2017 ewca crim

A query for each court and each year was constructed and then submitted by the platform's "citation" search field. The total number of judgments yielded by the query was extracted by capturing the count of results from the platform's underlying HTML.

The Data

The data captured is available here in raw form. The code used to generate the visualisation in this article is available here as a Jupyter Notebook.

annual coverage by publisher

The following graph provides an overview of the annual coverage for all of the courts studied by publisher. The following points leap out of graph:

  • BAILII's coverage of judgments is far lower than that provided by the three subscription-based platforms, running on a rough average of between 2,500-3,000 judgments per year.
  • Save for a drop in LexisLibrary's favour in 2011, JustisOne consistently provides the most comprehensive coverage of judgments.
  • From 2012, Lexis has closely tracked JustisOne's coverage
  • There is a sharp and sudden proportional drop in coverage from 2014 across all four platforms.

The key takeaway from this graph is that a significant number of judgments never make it onto BAILII every year.

BAILIIJustisLexisNexisWestlawUKPublisher20072008200920102011201220132014201520162017Year05001,0001,5002,0002,5003,0003,5004,0004,5005,0005,5006,0006,5007,0007,5008,000Count

The following graph provides an alternative view of the same data. 

BAILIIJustisLexisNexisWestlawUKPublisher2,0004,0006,0008,00010,00012,00014,00016,00018,00020,00022,000024,000Count20072008200920102011201220132014201520162017Year

total coverage of court by publisher

This graph provides an overview of how each publisher fares in terms of coverage of the courts included in the study. By and large, there is a health degree of parity in coverage of the following courts across all four publishers:

  • Chancery Division
  • Commercial Court
  • Court of Protection
  • Family Court
  • Family Division
  • Technology and Construction Courts

However, BAILII is struggling to keep up with the levels of comprehensiveness provided by the commercial publishers in the Administrative Court, both divisions of the Court of Appeal and the Queen's Bench Division. 

The dearth in coverage of judgments from the Criminal Division on BAILII is especially startling, particularly given rise numbers of criminal defendants lacking representation at the sentencing stage. Intuitively (though I have not confirmed this), the deficit in BAILII's coverage of the Criminal Division will almost certainly be judgments following an appeal against sentence. 

BAILIIJustisLexisNexisWestlawUKPublisher20,00025,00030,00035,00040,00045,00050,00055,00060,00065,00070,00075,00015,00010,0005,0000CountAdminChCivCommCrimEWCOPEWFCFamPatentsQBTCCCourt

(Interim) Conclusion

The data shows that BAILII is providing partial access to the overall corpus of judgments handed down in the courts studied. This, as I have previously been at pains to stress, is not down to any failing on BAILII's part. Rather, it is a symptom of how hopeless existing systems (such as they are) are at servicing BAILII with a comprehensive flow of cases to publish, particularly judgments given extempore. 

It also bears saying that the commercial publishers do not in any way obstruct BAILII from acquiring the material. A fuller discussion of the mechanics driving the problem will appear here soon.

Part 2: Open Access To English Case Law (The gaps)

This is the second substantive article in a series of pieces I am preparing in the run up to a talk I'll be giving in June at the annual conference of the British and Irish Association of Law Librarians (BIALL)

To recap, I first issued a primer, in which I essentially say that the state of open access to case law in the UK isn't where it ought to be in 2018 and that our open case law offering is out of step with similar projects elsewhere in the common law world (e.g. Canada and the United States). 

The primer was followed by the first substantive piece, which attempted to (i) define what "open access to case law" actually means and (ii) set out four justifications for providing open access to the decisions of judges. 

Comprehensive coverage of case law

The crucial point I sought to make in the first article was all about what "open access to case law" actually means. I define (perhaps a little crudely) open access to case law in the following terms:

Open access to case law" isn't a "thing", it's a goal. The goal, at least to my mind, boils down to providing access that is free at the point of delivery to the text of every judgment given in every case by every court of record (i.e. every court with the power to give judgments that have the potential to be binding on lower and co-ordinate courts) in the jurisdiction.

My definition places emphasis on comprehensiveness of coverage: the text of every judgment given in every case by every court of record should be freely available. I deliberately avoid folding additional requirements into the medley. I do not, for example, consider the inclusion of summaries and headnotes that explain the judgments to be part of the core mix (though, summaries are very much nice-to-haves). Nor do I say anything about technology (though, it goes without saying that delivery of the scale of comprehensiveness my definition requires could only be achieved with an online platform). 

Currently, the UK's primary open law outlet, BAILII, for reasons I'll go on to develop in the next article, is providing access to only a fraction of the judgments given in the senior courts. That this is the case, it should be noted, is through no fault on the part of BAILII.  

Gaps in BAILII's coverage

The following graph illustrates the problem. The graph is based on a count of the number of judgments given in the Court of Appeal (Criminal Division) with a [2017] EWCA Crim neutral citation. Justis, via their JustisOne platform, provide access to 1,216 judgments from the Criminal Division with a 2017 citation. WestlawUK doesn't fare quite as well, with 967 available Crim Div judgments. Now look at BAILII. Only 230 Criminal Division judgments are available for 2017

Count of judgments with [2017] EWCA Crim citation on BAILII, JustisOne and WestlawUK

Assuming that Justis' total of 1,216 represents the total number of judgments given in the Criminal Division of the Court of Appeal bearing a [2017] neutral citation and that the 230 judgments on BAILII form part of that overall total, we can project a view of the proportion of open-access to closed-access judgments (i.e. access is restricted to an area behind a subscriber paywall).

For anyone out there under the impression that there is any semblance of symmetry between the quantity of judgments available in the open and those accessible behind a paywall, the numbers point emphatically the other way. Taking the JustisOne total of 1,216 as the definitive quantity of 2017 Criminal Division judgments, only 19 percent (less than a fifth!) are freely available. 

The situation so far as availability of judgments flowing from the Civil Division of the Court of Appeal is concerned, is not quite as bad, though it still isn't good.

Court of Appeal (Civil Division)

Count of judgments with [2017] EWCA Civ citation on BAILII, JustisOne and WestlawUK

Again, taking the JustisOne count of 755 Civil Division judgments for 2017 as the definitive total and assuming that the 527 judgments available on BAILII are included in that total, the proportion of open to closed access is 70 percent, which is a good deal better than the criminal content but it's still falling well short of where it should be. 

The reasons underlying the lack of symmetry between open access coverage and the coverage offered by the commercial providers boil down to the hopelessly knackered pipeline that takes the judgments (whether handed down or given extempore) further downstream (more on this in the next article). 

Finally, it also bears saying that the lack of symmetry in coverage is not the fault of the commercial suppliers. They are not operating in a way that prevents BAILII from obtaining the data itself. It's just that the commercial suppliers have precisely what BAILII lacks: the resources to navigate a system of judgment supply that is entirely unfit for purpose and left to rot for far too long.

Part 1: Open Access To English Case Law (Why Bother?)

In June 2018, I'll be giving a plenary talk at the annual meeting of the British and Irish Association of Law Librarians. The topic I've chosen for the talk is open access to English case law. 

In the run up to the talk itself (primarily for the purposes of arranging my thinking on the content of the talk), I'll be releasing a series of articles on various aspects of the current state of open access to English case law. 

I published a "primer" a couple of weeks ago, in which I essentially say that the UK is running a fair bit behind the likes of Canada and the USA in the open access to case law stakes. My view is that notwithstanding the extraordinary contribution BAILII makes to the open law space, there remains considerable room for improvement. 

This is the first substantive article in the series (at least three or four more will follow). 

This article seeks to provide an outline of my thinking on two fundamental questions:

  1. What does "open access to case law" actually mean?
  2. Why bother providing open access to case law, what's the point?

What does "open access to case law" actually mean?

"Open access to case law" isn't a "thing", it's a goal. The goal, at least to my mind, boils down to providing access that is free at the point of delivery to the text of every judgment given in every case by every court of record (i.e. every court with the power to give judgments that have the potential to be binding on lower and co-ordinate courts) in the jurisdiction.

The goal sets a high bar. But it is a goal, after all. And, the attainment of that goal doesn't necessarily require any other bells and whistles. Things like summaries that explain the judgments, beautiful web interfaces, nice APIs and AI are nice to have bonuses, but they're not essential. The goal is first and foremost about providing access to the words used by the court when giving judgment in every case.

Why bother providing open access to case law, what's the point?

If I'm right about the goals of open access to case law, this question can be reformulated as: why bother providing free access to the text of all judgments given in every court? 

Well, at least four answers spring to mind.

The classic "rule of law" answer

In common law systems like ours, judges, in a broad range of circumstances, are able to make new laws or modify the scope of existing laws. There are any number of ways of casting that statement into tighter, more legalist language, but the essential point is that the words used by judges can, and often do, change the list of rules that govern what we can and can't do and the penalties we are liable to incur if we break those rules.

Because of this, in an ideal world, We, the People, would have some way of finding out what those rules are so that we're able to regulate our conduct to ensure we don't break them and to know what our rights are if we suffer as a result of someone else's breach of the rules.

The closer we move towards the "open access to case law" goal, the closer we get to being able to identify the rules we're expected to play by. 

The "equality of arms" answer

Accurately working out what the law says on issue X, Y or Z is not easy. We will often need an expert to help us determine what the law says on issue X, Y or Z and to help us understand our position in relation to it. These experts are called lawyers and lawyers cost money (generally, lots of money). 

The party to a dispute with access to a lawyer should (if their lawyer is any good) have at least two advantages over the party that does not have access to a lawyer:

  1. They will have the advantage of an advisor with expertise in the substantive law applicable to the dispute, which gives them an obvious head start.
  2. They will have the advantage of an advisor who has the advantage of multiple, industrial-strength tools to help them determine what the applicable law actually says .

The party to the dispute who lacks the means to access these two considerable advantages is therefore obviously at a correlating disadvantage. They're outgunned and probably outnumbered. There is an inequality of arms. Cuts in legal aid and, in many cases, the absence of legal aid altogether, increase the number of disputes in which one side is bringing a knife to a gunfight.

The closer we move towards the "open access to case law" goal, the more that state of inequality is reduced. Even if true equality cannot be achieved, some degree of access to the material governing the determination of who is likely to win and who is likely to lose begins to level the playing field. That's a good thing.

The "dispute reduction" answer

The ability to form a reasonably accurate view of what the rules say on issue X, Y or Z increases our ability to intelligently pick our battles and to nip disputes in the bud before they go anywhere near a court or some other costly method of dispute resolution. 

It may be hard to swallow, but if a litigant-to-be at least has the means of establishing that they probably don't have a leg to stand on (or has the other side bang to rights), more disagreements can be dealt with before lawyers get involved and things start to grow arms and legs. 

The closer we move towards the "open access to case law" goal, the greater our ability to resolve disputes before they morph into nasty, expensive, protracted and minified echoes of Jarndyce v Jarndyce

The "public information" answer

Courts are public institutions financed by public funds. Judgments are their unit of activity. Those units of activity should be open to public scrutiny and study. Judgments are public information (unless, there's a good reason to keep their content secret).

It may well be that nobody ever bothers to look at them. But that's not the point. The point is that if judgments are only meaningfully accessible on systems we have to pay to access, they're not meaningfully accessible to the public. 

Conclusion

In this short article, I've proposed a rough and ready definition of what "open access to case law" is and four justifications as to why it is a worthwhile pursuit. In the next article, I'm going to lock in on the nitty gritty of the state of open access to case law in the United Kingdom.

Open access to English case law (a Primer)

TL;DR

  • Innovation in the open case law space in the UK is stuck in the mud
  • BAILII is lagging behind comparable projects taking place elsewhere in the common law world: CanLII and CaseText are excellent examples of what's possible.
  • Insufficient focus, if any, is being directed to improving open access to English case law

There is a tsunami of innovation happening in the legal space right now. The problem is, so far as I can tell, none of it is being directed towards improving the way the decisions of judges in the English courts are made accessible to the wider public. 

Innovation in the pursuit of achieving broader, more intuitive and freer access to English case law has laid stagnant for at least five years. It is true that the United Kingdom has BAILII and nothing that follows in this series of blog posts is intended to take anything away from how important BAILII is or how successful it has been in opening access to the decisions of judges. However, BAILII (through no fault of its own) has been unable to keep pace with the levels of really positive innovation I've observed in similar projects taking place outside the UK (notably BAILII's Canadian equivalent, CanLII, and the US freemium/premium case law platform, CaseText). 

Open access to case law in the United Kingdom suffers from the following weaknesses (this list is by no means exhaustive):

  1. Gaps in coverage: there are too many gaps in the legacy case law archive and there are too many gaps in ongoing coverage of new judgments, especially those that are given extempore. There is still a vast amount of retrospective and prospective material that can only be accessed via paid subscription services.
  2. User-friendliness: BAILII is simple enough to use if you're used to researching the law online, but there is a considerable amount that could be done to improve the service for the benefit of lay users. 
  3. Sustainability: plenty of people use BAILII, but very few of them make donations to help BAILII raise enough financial resource to pursue product development projects.
  4. No platform for experimentation or third-party development: unlike CanLII, BAILII doesn't have a public API. Third-party innovation has stalled because it is incredibly difficult to acquire access to the text of the cases.

The weaknesses I've set out above are a function of the following broader problems (again, this list isn't exhaustive):

  1. The supply chain that takes a judgment (whether handed down or given extempore) to the wider public is messy and poorly understood by the Ministry of Justice (which is worrying, because they control that supply chain).
  2. Intellectual property rights over the judgments themselves is needlessly uncertain.
  3. There is no solid model for translating the way the common law works to the sort of open case law system we need.
  4.  BAILII, in several key ways, itself acts like a publisher of proprietary content.

This post is a "primer" for a series of blogs posts I'm writing on the subject in the run-up to a talk I'll be giving at the British and Irish Association of Law Librarian's in June 2018. 

Using Scikit-Learn to classify your own text data (the short version)

Last month I posted a lengthy article on how to use Scikit-Learn to build a cross-validated classification model on your own text data. The purpose of that article was to provide an entry point for new Scikit-Learn users who wanted to move away from using the built-in datasets (like twentynewsgroups) and focus on their own corpora.

I thought it might be useful to post a condensed version of the longer read for people who wanted to skip over the explanatory material and get started with the code.

As before, the objective of the code is as follows. We have a dataset consisting of multiple directories, each containing n text files. Each directory name acts as a descriptive category label for the files contained within (e.g. technology, finance, food). We're going to use this data to build a classifier capable of recieving new, unlabeled text data and assigning it to the best fitting category.

The code

import sklearn
import numpy as np
from glob import glob
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.externals import joblib

Get paths to labelled data

rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/")

print ('\nGathering labelled categories...\n')

categories = []

Extract the folder paths, reduce down to the label and append to the categories list

for i in rawFolderPaths:
    string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','')
    category = string1.strip('/')
    #print (category)
    categories.append(category)

Load the data

print ('\nLoading the dataset...\n')
docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML",description=None, categories=categories, load_content=True, encoding='utf-8', shuffle=True, random_state=42)

Split the dataset into training and testing sets

print ('\nBuilding out hold-out test sample...\n')
X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data, docs_to_train.target, test_size=0.4)

Transform the training data into tfidf vectors

print ('\nTransforming the training data...\n')
count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(raw_documents=X_train)

tfidf_transformer = TfidfTransformer(use_idf=False)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print (X_train_tfidf.shape)

Transform the test data into tfidf vectors

print ('\nTransforming the test data...\n')
count_vect = CountVectorizer(stop_words='english')
X_test_counts = count_vect.fit_transform(raw_documents=X_test)

tfidf_transformer = TfidfTransformer(use_idf=False)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
print (X_test_tfidf.shape)

print (X_test_tfidf)
print (y_train.shape)

docs_test = X_test

Construct the classifier pipeline using a SGDClassifier algorithm

print ('\nApplying the classifier...\n')
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                     ('tfidf', TfidfTransformer(use_idf=True)),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                      alpha=1e-3, random_state=42, verbose=1)),
])

Fit the model to the training data

text_clf.fit(X_train, y_train)

Run the test data into the model

predicted = text_clf.predict(docs_test)

Calculate mean accuracy of predictions

print (np.mean(predicted == y_test))

Generate labelled performance metrics

print(metrics.classification_report(y_test, predicted,
    target_names=docs_to_train.target_names))

Scraping news websites and looking for specific words and phrases

This afternoon, my colleague and Transparency Project member, Paul Magrath, told me he was interested finding out whether there's a way of systematically watching out for a set of pre-defined "trigger words" of interest to the Transparency Project in online articles published by a selection of news organisations with a nasty habit of misreporting family court proceedings. 

I thought "that's a perfect job for Python" and sat down to write a basic proof of concept for Paul to take a look at. 

The code, which is here, iterates through an RSS feed on the Daily Mail's online site, reads each article by requesting the article link for each item in the feed and checks it for a list of pre-defined triggers (currently devised around an article about Myleene Klass, of all people). The output is generated back to a CSV file for review. 

Here's the GitHub repo.

Using scikit-Learn on your own text data

Scikit-learn’s Working with Text Data provides a superb starting point for learning how to harness the power and ease of the sklearn framework for the construction of really powerful and accurate predictive models over text data. The only problem is that scikit-learn’s extensive documentation (and, be in no doubt, the documentation is phenomenal) doesn’t help much if you want to apply a cross-validated model on your own text data

At some point, you’re going to want to move away from experimenting with one of the built-in datasets (e.g. twentynewsgroups) and start doing data science on textual material you understand and care about. 

The purpose of this tutorial is to demonstrate the basic scaffold you need to build to apply the power of scikit-learn to your own text data. I’d recommend methodically working your way through the Working with Text Data tutorial before diving in here, but if you really want to get cracking, read on.

If you can't be bothered reading on and just want to see the code, it's in a repo on GitHub, here.

Objectives

Before we start, let’s be clear about what we’re trying to do. We have a great big collection of text documents (ideally as plain text from the offing). Our documents are, to use the twentynewsgroups example, all news articles. The news articles have been grouped together, in directories, by their subject matter. We might have one subdirectory consisting of technology articles, called Technology. We might have another subdirectory consisting of articles about tennis, called Tennis

Our project directory might look like this (assume each subdirectory has 100 text documents inside):

news_articles \
    art
    business
    culture
    design
    food
    technology
    tennis
    war

The aim of the game is to use this data to train a classifier that is capable analysing a new, unlabelled article and determining which bucket to put it in (this is an article about food, this is an article about business, etc). 

What our code is going to do

We’re going to write some code, using scikit-learn, that does the following:

  • Loads our dataset of news articles and categorises those articles according to the name of the folder they live in (e.g. art, food, tennis)
  • Splits the dataset into two chunks: a chunk we’re going to use to train our classifier and another chunk that we’re going to use to test how good the classifier is
  • Converts the training data into a form the classifier can work with
  • Converts the test data into a form the classifier can work with
  • Builds a classifier 
  • Applies that classifier to our training data
  • Fires the test data into our trained classifier
  • Tells us how well the classifier did at predicting the right label (art, food, tennis etc) of the each document in the test dataset

1. Get the environment ready

The first job is to bring in everything we need from scikit-learn:

import sklearn
import numpy as np
from glob import glob
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline 

That’s the stage set in terms of bringing in our dependencies.

2. Set our categories

The next job is to state the names of the categories (our folders of grouped news articles) in a list. These need to exactly match the names of the subdirectories acting as the categorical buckets in your project directory.

categories = [‘art’, ‘business’, ‘culture’, ‘design’, ‘food’, ‘technology’, ‘tennis’, ‘war’]

This approach of manually setting the folder names works well if you only have a few categories or you’re just using a small sample of a larger set of categories. However, if you’ve got lots of category folders, manually entering them as list items is going to be a bore and will make your code very, very ugly (I'll write a separate blog post on a better way of dealing with this, or look at the repo on GitHub, which incorporates the solution to this problem).

3. Load the data

We’re now ready to load our data in:

docs_to_train = sklearn.datasets.load_files(“/path/to/the/project/folder/“, 
    description=None, categories=categories, 
    load_content=True, encoding='utf-8', shuffle=True, random_state=42)

All we’re doing here is saying that our dataset, docs_to_train, consist of the files contained within all of the subdirectories to the path specified inside the .load_files function and that the categories are the categories set out in our categories list (see above). Forget about the other stuff in there for now.

4. Split the dataset we’ve just loaded into a training set and a test set

This is where the real work begins. We’re going to use the entire dataset, docs_to_train, to both train and test our classifier. For this reason, we’ve got to split the dataset into two chunks: one chunk for training and another chunk (that the classifier won’t get to look at in training) for testing. We’re going to “hold out” 40% of the the dataset for testing:

X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data,
    docs_to_train.target, test_size=0.4)

It’s really important to understand what this line of code is doing. 

First, we’re creating four new objects, X_train, X_test, y_train and y_test. The X objects are going to hold our data, the content of the text files. We’ve got one X object, X_train, and that will hold the text file data we’ll use to train the classifier. We have another X object, X_test, and that will hold the text file data we’ll use to test the classifier. The Xs are the data.

The we have the Ys. The Y objects hold the category names (art, culture, war etc). y_train will hold the category names that correspond to the text data in X_train. y_test will hold the category names category names that correspond to the text data in X_test. The y value are the targets. 

Finally, we’re using test_size=0.4 to say that out of all the data in docs_to_train we want 40% to be held out for the test data in X_test and y_test.

5. Transform the training data into a form the classifier can work with

Our classifier uses mathematics to determine whether Document X belongs in bucket A, B, or C. The classifier therefore expects numeric data rather than text data. This means we’ve got to take our text training data, stored in X_train, and transform it into a form our classifier can work with. 

count_vect = CountVectorizer(stop_words='english')

X_train_counts = count_vect.fit_transform(raw_documents=X_train)

These two lines are doing are a lot of heavy lifting and I would strongly urge you to go back to the Working with Text Data tutorial to fully understand what’s going on here.

The first thing we’re doing is setting up a vectoriser, CountVectorizer(). This is a function that will count the number of times each word in the dataset occurs and project that count into a vector. 

Then, we take that vector and apply it to the training data stored in X_train. We store those occurrence vectors in X_train_counts.

Once that’s done we move on to the clever transformation bit. We’re going to take the occurrence counts, stored in X_train_counts, and transform them into a term frequency inverse document frequency value. 

tfidf_transformer = TfidfTransformer(use_idf=True)

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

Why are we doing this? Well, if you think about it, the documents in your training set will naturally vary in word length; some are going to be long, others are going to be short. Longer documents have more words in them and that’s going to result in a higher word count for each word. That’s going to skew your results. What we really want to do is get get a sense of the count of each word proportionate to the number of words in the document. Tf-idf (term frequency inverse document frequency achieves this). 

6. Transform the test data into a form the classifier can work with

Since we’ve gone to the trouble of splitting the dataset into a training set and a test set, we also need to transform our test data in exactly the same way as we just did with the training set. All we’re doing here is mirroring the transformation process we just applied to X_train onto X_test. 

count_vect = CountVectorizer(stop_words='english')
X_test_counts = count_vect.fit_transform(raw_documents=X_test)

tfidf_transformer = TfidfTransformer(use_idf=True)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)

7. Scikit-learn gives us a far better way to deal with these transformations: pipelines!

It was worth reading about the transformation process, because if you’re working with text data and trying to do science with it you really do need to at least see why and how that text is transformed into a numerical form a predictive classifier can deal with. 

However, scikit-learn actually gives us a far more efficient way (in terms of lines of code) to deal with the transformations — it’s called a pipeline. The pipeline is this example has three phases. The first creates the vectoriser — the machine used to turn our text into numbers — a count of occurrences. The second phase deals with transforming the crude vectorisation handled in the first into a frequency-based representation of the data — the term frequency inverse document frequency. Finally, and most excitingly, the third phase of the pipeline sets up the classifier — the machine that’s going to train the model. 

Here’s the pipeline code:

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer(use_idf=True)),
    ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, 
    verbose=1)),])

For now, don’t worry about the parameters set out in the classifier, just appreciate the structure and content of the pipeline. 

8. Deploy the pipeline and train the model

Now it’s time to train our model by applying the pipeline we’ve just built to our training data. All we’re doing here is taking our training data (X_train) and the corresponding training labels (y_train) and passing them into the fit function that comes built into the pipeline. 

text_clf.fit(X_train, y_train)

Depending on how big your dataset is, this could take a few minutes or a bit longer. 

As a sidenote, on this point, you might have noticed that I set the verbose parameter in the classifier as 1. This is purely so I can see that the classifier is running and that the script isn't hanging because I’m chewing through memory. 

9. Test the model we’ve just trained using the test data

We’ve now trained our model on the training data. It’s time to see how well trained the model really is by letting it loose on our test data. What we’re going to do is take the test data, X_test, let the model evaluate it based on what it learned from being fed the training data (X_train) and the training labels (y_train) and see what categories the model predicts the test data belongs to. 

predicted = text_clf.predict(X_test)

We can measure the model’s accuracy by taking the mean of the classifier’s predictive accuracy like so:

print (np.mean(predicted == y_test))

Better yet, we can use scikit-learn’s built in metrics library to give us some detailed performance statistics by category (i.e. how well did the classifier at predicting that an article about “design” is an article about “design”.

print(metrics.classification_report(y_test, predicted, 
    target_names=docs_to_train.target_names))

The metrics will provide you with a precision score for each category and an overall average of the performance of the model between 0 and 1. 

The closer that average score gets to 1, the better the model will perform. Mind you, if your model is averaging a score of 1 on the nose, something has gone wrong!   

Learning Python: a handful of tips

I've been learning Python for the past year or so and have just reached the stage where I've moved from being a beginner to an intermediate level user of the language. Out of all of the programming languages I've dabbled in (C++, Delphi, Java, R and Javascript), Python is my firm favourite and the only language I've settled down with. 

Python, in my view at least, is an extraordinarily approachable language because:

  1. The syntax is extremely readable, descriptive and easy to follow
  2. It's interpreted, meaning you don't have to go through the bother of compiling your code every time you want to run it. 
  3. The community and documentation is phenomenal.

It didn't take too long for me to start writing code that just about did what I wanted it to do. In the main, I'd have an idea (say, to write some code to extract information from a group of text documents), sketch out roughly how I thought the code should be structured (e.g. do this, then do that, then do the other), and hunt through StackOverflow in search of examples of how to accomplish each step. 

I'm now at the point where I'm able to do 80% of the code for any particular project without having to resort to StackOverflow and I'm more confident in starting to write my code unassisted.  

Looking back, several habits I've started to practise have helped me move my knowledge of Python forward at a faster rate and I think that had I have started to practise these habits a bit sooner, I'd have reached the level I'm at a little quicker.

So, what follows are some general tips that may help you if you're a beginner who wants to make progress with Python a bit faster. 

TIP #1: USE PYCHARM CE

For the first six months, I did the vast bulk of my coding in the interpreter at the command line, a la:

$ python

> print "hello, world"
hello, world

This worked really well early on when I was trying trivial examples, because I was immediately seeing the result of the code I was writing. However, as soon as I wanted to produce something with more than 20 lines of code, I got into a horrible mess. 

I soon realised that I needed to start using a proper Python IDE and after a bit of digging around, I settled on PyCharm CE. It's free and it's excellent, because:

  • You can run your code in the IDE whenever you like, so the instant feedback of the command line interpreter is available at the click of the button, but your code is still sitting there in the editor helping you keep track on what's going on where.
  • PyCharm automatically deals with whitespace indents, so you won't run into errors because you've forgotten to indent the first statement in a for loop.
  • You can easily flip between the 2.7 and 3.6 interpreter
  • The editor provides feedback line by line on errors and style
  • The editor provides inline information that helps you see, for example, what arguments a particular function takes
  • The editor auto-completes your code

TIP #2: MAKE LIBERAL USE OF THE PRINT FUNCTION

Whether I'm writing my own program from scratch or playing with a complex example, I initially pepper my code with print. It really helps to be able to see what's going on in the code at various stages. So, my advice is as follows: if it moves, print it to the console and take a look. 

TIP #3: LOOK AFTER YOUR CODE

You're going to find that as you begin to grow more competent you'll start referring back to your own code to remind yourself how to perform certain types of operation. Your own code, in a sense, will become a valuable source of reference.

Keep your code organised in a way that makes sense to you. I recommend these simple rules to help you keep your code accessible:

  • Store each distinct project in its own folder and give the folder a sensible name
  • Give your individual Python files sensible names, too
  • At the top of your scripts, add a short comment block explaining what the code sets out to achieve (this comes in handy when you're looking back over your older code)

TIP #4: FOCUS ON GETTING THE CODE TO WORK

Personally, I'm of the view that as beginners our first focus should be getting the code to work rather than ensuring it looks beautiful or runs blisteringly fast. Writing optimised, beautiful code comes later - the priority is first to get your code to do what you want it to do.

TIP #5: keep notes

Note-taking forms an important part of our general process for learning anything (whether it's courses on a law degree, learning quadratics in secondary school or learning Python), so take notes as you progress through the fundamentals in Python.

Notes serve at least two really useful functions:

  • They reinforce your learning through the process of writing the notes themselves
  • They act as an invaluable reference source as you start to accumulate a larger base of knowledge. 

Your notes can be as basic or elaborate as you like, but it's a good idea to get a note of the fundamentals, such as the structure of different loops, iterating over files, reading and writing files etc.

TIP #6: DON'T BE AFRAID TO BREAK YOUR CODE

As a beginner, there's nothing like slogging over a coding project and finally seeing it run and work as expected. Once this point has been reached, there may be a strong reluctance to touch the code again for fear of breaking it. This is understandable, but it will prevent you from  reviewing your code and making improvements to it.

If you're really worried that you're going to break the code, copy and paste it into a new file and work on it away from the original. Asking how you can make the intention of your code more obvious and finding ways to reflect that in your code is a really important part of the Python learning curve.

TIP #7: IMMERSE YOURSELF IN ERRORS

Errors that cause your code to halt in its tracks are inevitable. I run into at least half a dozen whenever I'm attempting something new or complex. As beginners, error messages can be a bit bamboozling, but it's important that you consciously engage with them and pursue a way to fix them rather than being tempted to abandon the project.

To begin with, errors will look pretty alien. However, one of the things I've come to love about Python is how descriptive its error handling is once you've got used to it. My advice is to make an effort to read the entire error trace, which will tell you exactly where in your code the interpreter is running into problems. If the error doesn't speak for itself, copy and paste it into Google: chances are there will be a StackOverflow question and answer about the error you're hitting.

Persevere with errors and work hard to fix them.

TIP #8: READ DOCUMENTATION

As with any new technical subject, the surrounding literature will initially be difficult to penetrate. However, the more you make a habit of reading documentation, the more sense it will start to make over time. 

Python benefits from an excellent community of developers who put an incredible amount of thought into the documentation they produce for their code, take advantage of it.

TIP #9: THINK ABOUT CODING WHEN YOU'RE NOT CODING

A significant amount of the cerebral heavy lifting for a coding project occurs when you're nowhere near your machine. More often that not, an idea forms in my mind when I'm walking over Waterloo Bridge or sitting on a train. When those moments arrive, I start planning the structure of the code in my mind and identifying obstacles and dependencies (am I going to need to restructure or wrangle of a dataset? what packages will I need to do that? what processing do I need to do before I get to the heart of the code? what do I think I'm going to need to learn before I can do what I want to do?). 

Allowing ideas to have to chance to marinate before you even start writing the code will get you off to a far better start than blindly jumping straight in.

TIP #10: LISTEN TO TALK PYTHON TO ME

The TalkPythonToMe podcast is fabulous. The podcast's host, Michael Kennedy, has curated what is in all probability the definitive Python podcast and he knows his onions. 

There's definitely something about listening to experts talk about a subject you're trying to get a grip on, even if for some or most of the time you have no idea what they're going on about. 

There's your starter for ten. Good luck and have fun!

 

A (Brief) Excursion into Topic Modelling with Mallet

For the past three months or so, I've been experimenting with a range of topic models across a range of technologies, including R, Python, C++ and Java. 

I've recently been spending time with MALLET (a Java-based suite of NLP tools) and I'm really impressed with how easy this implementation is to get working. 

There is a truly excellent walkthrough courtesy of the Programming Historian right here, which makes everything perfectly clear if you're coming towards this without much experience. 

As is usual, I tested MALLET with a reasonably large corpus of judgments from the Criminal Division of the Court of Appeal, which I had organised as .txt files in a directory on my machine.

The following steps provide a basic outline of how I got everything going:

Download MALLET

Speaks for itself. You can download MALLET here. Unzip the .tar file to a directory of your choosing.

Import the data

The first thing we need to do is import the data into MALLET. To do this, go to the directory in which you unpacked the MALLET .tar file at the command line and then run the following command:

bin/mallet import-dir --input path/to/the/your/data --output topic-input.mallet \
  --keep-sequence --remove-stopwords

This runs MALLET, points it to the directory holding your data, creates the input file you'll use in the next step (topic-input.mallet) and removes uninteresting words (like a, of, the, for, etc)

Build the topic model

The steps above shouldn't have taken you much more than 5-10 minutes. This bit is the fun part - building the topic model. 

At the command line, run:

bin/mallet train-topics --input topic-input.mallet --num-topics 50 --output-state topic-state.gz --output-doc-topics doc-topics.txt --output-topic-keys topic_keys.txt

This passes in the input file generated in the step above, sets the number of topics to generate at 50 and then specifies a range of outputs. 

The most interesting outputs generated are:

  • topic-keys.txt, which sets out the topics and the key terms within those topics
  • doc-topics, which sets out the main topic allocations for each document in the dataset.

A first run with Hierarchical Dirichlet Process

I've been experimenting with Latent Dirichlet Allocation for a while in R, but was looking for a topic model algorithm that did not require the number of topics (k) to be defined apriori the application of the algorithm to the text data I wanted to work with. 

As a relative newcomer to topic modelling, I hadn't even heard of David Blei or Chong Wang, both of whom I now know to be pioneers in modern topic modelling. Quite by chance, I stumbled on Wang and Blei's implementation of Hierarchical Direchlet Processing in C++, a topic model where the data determine the number of topics. 

Getting HDP to work required quite a bit of wrangling and I couldn't find any walkthroughs suitable for novices like me, so I thought it would be worth noting up how I managed to get it to work with a small sample set of text data. 

What follows is far from a perfect (or even good) representation of how to apply HDP to text data, but the steps that follow did work for me. Here goes...


Sample Data

It goes without saying that the very first step is to assemble a corpus of text data against which we'll apply the HDP algorithm. In my case, as is usual, I used ten English judgments (all of which are recent decisions from the Criminal Division of the Court of Appeal) in .txt format. Save these into a folder.


Getting the Sample Data ready for HDP

Before we even go near Wang & Blei's algorithm, we need to prepare the sample data in a particular way. 

The algorithm requires the data to be in LDA-C format, which looks like this:

 [M] [term_1]:[count] [term_2]:[count] ...[term_N]:[count]
where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document.

This presents the first problem, because our data appear as words in .txt format files. Fortunately, there's an excellent Python program called text2ldac that comes to our rescue. Text2ldac takes the data in .txt format and outputs the files we need, in the form in which we need them.

Clone text2ldac from the git repo here

Once you've pulled down text2ldac, you're ready to take your text files and process them. To do this, go to the command line and run the following command (make adjustments to the example that follows to suit your own directories and filenames:

$ text2ldac danielhoadley$ python text2ldac.py --stopwords stopwords.txt /Users/danielhoadley/Documents/Topic_Model/text2ldac/input

All that's happening here is we're running text2ldac.py, using the --stopwords flag to pass in our stopwords (which, in my case, are in a file named stopword.txt) and then passing in the directory that contains our .txt files.

This will output three files: a .dat file (e.g. input.dat) which is the all import LDA-C formatted input for the HDP algorithm, a .vocab file, which contains all of the words in the corpus (one word per line) and a .dmap file, which lists the input .txt documents. 


Time to run HDP

Now that we have our data in the format required by the HDP algorithm, we're ready to apply the algorithm. 

For convenience, I recommend copying the three files generated by text2ldac into the folder you're going to run HDP from, but you can leave them wherever you like. 

Go to the folder containing the HDP program files and run the following command (again, adjust to your own folder and filenames:

$ ./hdp --algorithm train --data /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.dat --directory train_dir

Let's unpack this a bit: 

1. ./hdp invokes the HDP program

2. The --algorithm flag sets the algorithm to be applied, namely train

3. The path that follows the algorithm flag points to the .dat file produced by text2ldac

4. --directory train_dir is telling HDP to place the output files in a directory called train_dir

You'll know you've successfully executed the program if the prompt begins printing something that looks like this:

Program starts with following parameters:
algorithm:= train
data_path:= /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.dat
directory:= trainer
max_iter= 1000
save_lag= 100
init_topics = 0
random_seed = 1488746763
gamma_a = 1.00
gamma_b = 1.00
alpha_a = 1.00
alpha_b = 1.00
eta = 0.50
#restricted_scans = 5
split-merge = no
sampling hyperparam = no

reading data from /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.dat

number of docs: 9
number of terms : 5795
number of total words : 35865

starting with 7 topics 

iter = 00000, #topics = 0008, #tables = 0076, gamma = 1.00000, alpha = 1.00000, likelihood = -305223.54210
iter = 00001, #topics = 0008, #tables = 0079, gamma = 1.00000, alpha = 1.00000, likelihood = -301582.68017
iter = 00002, #topics = 0008, #tables = 0079, gamma = 1.00000, alpha = 1.00000, likelihood = -300273.98808

I'm not going to go into all of these parameters here, but the main one to note is the max_iter, which sets the number of time the algorithm walks over the test data. 

Note also that the algorithm has decided by itself how many topics it's going to work with (in the above example, 7)

The algorithm will produce a bunch of .dat files. The one we're really interested in should have a name like mode-topics.dat and not like this 00300-topics.dat (which was produced on the 300th iteration of the algorithm's walk). 


Printing the topics determined by HDP

Wang & Blei very helpfully provided a R script, print.topics.r, which you can use to turn the results of the algorithm into a human-readable form. This is helpful because the output generated by the algorithm will look like this:

00001 00001 00007 00007 00007
00011 00000 00000 00000 00000

The key thing at this stage is to remember two key files you'll need as input for the R script: the mode-topics.dat file (or similar name) generated by HDP and the .vocab file generated by text2ldac. 

Go back to the command line and navigate to the folder that contains print.topics.r. First, you'll need to make the R script executable, so run:

$ sudo chmod +x print.topics.R

Then run

$ ./print.topics.r /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/trainer/mode-topics.dat /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.vocab topics.dat 4

1. ./print.topics.r runs the R script

2. The first argument is the path to the mode-topics.dat file produced by the HDP algorithm

3. The second argument is the path to the .vocab file produced by text2ldac

4. The third argument is the name of the file you want to output the human-readable result to, e.g. topics.dat

5. Finally, the fourth argument, which isn't mandatory, is the number of terms per topic you wish to output - the default is 5.


The output

If everything has worked as it ought to have done, you'll get an output like this if you look at the topic.dat file in RStudio:

The output is by no means perfect the first time around. Better results will probably depend on hitting the source data with a raft of stop words and tweaking HDP's many parameters, but it's a good start.

 

 

 

 

Sentiment in Case Law

Created with Sketch.

For the past few months, I've been exploring various methods of unlocking interesting data from case law. This post discusses the ways in which data science techniques can be used to analyse the sentiment of the text of judgments.

The focus in this post is mainly technical and describes the steps I've taken using a statistical programming language, called R, to extract an "emotional" profile of three cases.

I have yet to arrive at any firm hypothesis of how this sort of technique can be used to draw conclusions that would necessarily be of use in court or during case preparation, but I hope some of the initial results canvassed below are interest to a few. 


TidyText is an incredibly effective and approachable package in R for text mining that I stumbled across when flicking through some of the Studio::Conf 2017 materials a few days ago. 

There's loads of information available about the TidyText package, along with its underlying philosophy, but this post focuses on an implementation of one small aspect of the package's power: the ability to analyse and plot the sentiment of words in documents.

MY TEST DATA

I'm using a small dataset for this walkthrough that consists of three court judgments: two judgments of the UK Supreme Court and one from its predecessor, the Judicial Committee of the House of Lords:

The subject matter of the dataset isn't really that important. My purpose here is to use the tools TidyText makes available to chart the emotional attributes of the words used in these judgments over the course of each document. 

GET THE DATA READY FOR ANALYSIS

First off, we need to get our environment ready and set our working directory:

# Environment
library(tidyverse)
library(tidytext)
library(stringr)

# Data - setwd to the folder that contains the data you want to work with
setwd("~/Documents/R/TidyText")

Next, we're going to get the text of the files (in my case, the three judgments) into a data frame:

case_words <- data_frame(file = paste0(c("evans.txt", "miller.txt", "donoghue.txt"))) %>%
mutate(text = map(file, read_lines))

This gives us a tibble with a single variable equal to the name of the file. We now need to unnest that tibble so that we have the document, line numbers and words as columns.

case_words <- case_words %>%
unnest() %>%
mutate(line_number = 1:n(),
 file = str_sub(basename(file), 1))
case_words$file <- forcats::fct_relevel(case_words$file, c("evans", "miller", "donoghue"))

We get a tibble that looks like this:

# A tibble: 96,318 × 3
file line_numberword
<fctr> <int> <chr>
1evans.txt 1lord
2evans.txt 1 neuberger
3evans.txt 1with
4evans.txt 1whom
5evans.txt 1lord
6evans.txt 1kerr
7evans.txt 1 and
8evans.txt 1lord
9evans.txt 1reed
10 evans.txt 1 agree
# ... with 96,308 more rows

You can check the state of your table at this point by running,

head(case_words)

The last thing we need to do before we're ready to begin computing the sentiment of the data is to tokenise the words in our tibble:

case_words <- case_words %>%
unnest_tokens(word, text) 

SENTIMENT ANALYSIS OF THE JUDGMENTS

We are ready to start analysing the sentiment of the data. TidyText is armed with three different sentiment dictionaries, afinn, nrc and Bing. The first thing we're going to do is get a birds eye view of the different sentiment profiles of each judgment using the nrc dictionary and plot the results using ggplot:

case_words %>%
inner_join(get_sentiments("nrc")) %>%
group_by(index = line_number %/% 20, file, sentiment) %>%
summarize(n = n()) %>%
ggplot(aes(x = index, y = n, fill = file)) + 
geom_bar(stat = "identity", alpha = 0.7) + 
facet_wrap(~ sentiment, ncol = 3)

A bird-eye view of ten emotional profiles of each judgment

The x-axis of each graph represents the position within each document from beginning to end, the y-axis quantifies the intensity of the sentiment under analysis. 

We can get a closer look at the emotional fluctuation by plotting an analysis using the afinn and Bing dictionaries:

case_words %>% 
left_join(get_sentiments("bing")) %>%
left_join(get_sentiments("afinn")) %>%
group_by(index = line_number %/% 20, file) %>%
summarize(afinn = mean(score, na.rm = TRUE), 
bing = sum(sentiment == "positive", na.rm = TRUE) - sum(sentiment == "negative", na.rm = TRUE)) %>%
gather(lexicon, lexicon_score, afinn, bing) %>% 
ggplot(aes(x = index, y = lexicon_score, colour = file)) +
geom_smooth(stat = "identity") + 
facet_wrap(~ lexicon, scale = "free_y") +
scale_x_continuous("Location in judgment", breaks = NULL) +
scale_y_continuous("Lexicon Score")

Sentiment curves using afinn and bing sentiment dictionaries

FINDINGS

The Bing analysis (pictured right), appears to provide a slightly more stable view. Instances moving above the zero-line indicate positive emotion, instances moving below the zero line indicate negative emotion.

For example, if we take the judgment in R (Miller), we can see that the first half of the judgment is broadly positive, but then dips suddenly around the middle of the judgment. The line indicates that the second half of the judgment text is slightly more negative than the first half, but rises to it's peak of positivity just before the end of the text.

The text of the judgment in Donoghue is considerably more negative. The curve sharply dips as the judgment opens, turns more positive towards the middle of the document, takes a negative turn once more and resolves to a more positive state towards the conclusion.

Legislation.gov.uk Statute Scraper

The following Python script can be used to scrape the full text of Public General Acts from legislation.gov.uk.

Prerequisites

The script takes a list of URLs to individual pieces of legislation from a text file. The script processes each URL one by one. The text file needs to look something like this, with each target URL on a new line:

url.txt

http://www.legislation.gov.uk/ukpga/2016/4 
http://www.legislation.gov.uk/ukpga/2016/3 
http://www.legislation.gov.uk/ukpga/2016/2 
http://www.legislation.gov.uk/ukpga/2016/1 
http://www.legislation.gov.uk/ukpga/2017/1 
http://www.legislation.gov.uk/ukpga/2017/2

The Python script is simple enough. 

  • First, it opens url.txt and reads the target URLs line by line
  • For each target URL, the title of the legislation is captured (this is used to name the output files)
  • Each url is cycled through sequentially and the contents of the relevant part of the HTML markup is scraped
  • The scraped material is written to a text file and an prettified HTML file.

Scraper.py

# Environment

import requests
import time
import io
from bs4 import BeautifulSoup
from urllib import urlopen

# Get the text file with the URLs to be scraped and scrape the target section in each page

print "\n\nScraping URLs in urls.txt...\n\n"

with open('urls.txt') as inf:
    
    # Get each url on each line in urls.txt
    urls = (line.strip() for line in inf)
    for url in urls:
        site = urlopen(url)
        soup = BeautifulSoup(site, "lxml")
        
        # Scrape the name of the legislation in each target url for use when saving the output to a file
        
        for legName in soup.find_all("h1", {"class": "pageTitle"}):
            
            actTitle = legName.text
                
                print 'Scraping ' + actTitle + ' ...\n'
    
        # Scrape stuff in <div id="viewLegContents"></div>
        
        for item in soup.find_all("div", {"id": "viewLegContents"}):
            
                # Write what we've scraped, with UTF-8 encoding, as text to a new text file - one file per url
                
                with io.open (actTitle + '.txt', 'w', encoding='utf-8') as g:
                    g.write(item.text)
                    
                    # Write what we've scraped to an html file - one file per url
                    
                    with open (actTitle + '.html', 'w') as g:
                        g.write(item.prettify('utf-8'))

print "\n\nDone! Files created.\n"

Respect the source of the data you're scraping

The people behind legislation.gov.uk have done everyone a big favour in making their information so accessible. The least we can do is to be respectful of their servers when performing scraping tasks like this. If you plan on running this, I'd strongly urge you to break your url.txt input into small chunks and, if you go on to reuse the data, remember to acknowledge the source of that data.

Rapid Keyword Extraction of Donoghue v Stevenson

Sometimes it would be really handy to be able to quickly and accurately extract keywords from a large corpus of documents. It is quite easy to foresee such a use-case arising in legal publishing, for example. 

RAKE (Rapid Keyword Extraction), is a Python natural language processing module that goes a long way in dealing with this use-case. 

I was interested in putting RAKE to the test and thought I'd pit the algorithm against what is perhaps to most well known piece of case law in the common law world: Donoghue v Stevenson (of snail and ginger beer fame). 

What follows is the basic "working out" of the code and the results of the first pass. For anyone interested in replicating this experiment or doing some keyword extraction of their own, see this excellent tutorial - you'll see that my own code follows it closely.

IMPORT THE RELEVANT LIBRARIES

import rake 
import operator

INITIALISE RAKE

rake_object = rake.Rake("smartstoplist.txt", 5, 5, 7)

This line of code does the following:

  • Creates a RAKE object that extracts keywords where (i) each word has at least 5 characters; (ii) each phrase must have at least 5 words; and (iii) each keyword must appear in the text at least 7 times
  • Hits the text file with a list of stop words to remove textual noise

GET THE TEXT

Now we open the text file (in this test, I've saved the judgment in Donoghue as a text file) and save it in a variable:

judgment = open("dono.txt","r") 
text = judgment.read()

RUN RAKE AND PRINT THE KEYWORDS

Now we're ready to run RAKE over the text to get the keywords:

keywords = rake_object.run(text) 
print (keywords)

THE OUTPUT

The following keywords (along with their scores) were returned:

[('give rise', 4.300000000000001), ('common law', 4.184313725490196), ('duty owed', 4.154061624649859), ('ordinary care', 4.115278543849972), ('reasonable care', 4.093482554312047), ('skivington lr 5', 4.050000000000001), ('lake & elliot', 4.0), ('pender 11 qb', 3.966666666666667), ('present case', 3.7993197278911564), ('defective', 1.7619047619047619), ('present', 1.7380952380952381), ('principles', 1.7333333333333334), ('dangerous', 1.6491228070175439), ('exercise', 1.588235294117647), ('cases', 1.5875), ('bottles', 1.5833333333333333), ('liability', 1.5789473684210527), ('relationship', 1.5555555555555556), ('court', 1.5365853658536586), ('supplying', 1.5), ('appears', 1.4761904761904763), ('principle', 1.4736842105263157), ('allowed', 1.4545454545454546), ('party', 1.4375), ('nature', 1.4210526315789473), ('warranty', 1.4166666666666667), ('goods', 1.4090909090909092), ('thing', 1.4090909090909092), ('articles', 1.4), ('condition', 1.4), ('appellant', 1.3953488372093024), ('injured', 1.3863636363636365), ('alleged', 1.375), ('bought', 1.3636363636363635), ('stated', 1.3636363636363635), ('examination', 1.3636363636363635), ('opportunity', 1.3636363636363635), ('appeal', 1.3333333333333333), ('support', 1.3333333333333333), ('defect', 1.3333333333333333), ('decided', 1.3333333333333333), ('relation', 1.3333333333333333), ('bottle', 1.3225806451612903), ('matter', 1.3125), ('authorities', 1.3125), ('injury', 1.3076923076923077), ('carelessness', 1.3076923076923077), ('judgment', 1.3055555555555556), ('proposition', 1.3043478260869565), ('recover', 1.3), ('referred', 1.3), ('circumstances', 1.2972972972972974), ('supplied', 1.2857142857142858), ('found', 1.2857142857142858), ('based', 1.2777777777777777), ('defendant', 1.2666666666666666), ('liable', 1.263157894736842), ('article', 1.26), ('manufactured', 1.25), ('lordships', 1.25), ('danger', 1.25), ('means', 1.25), ('poison', 1.25), ('inspection', 1.2307692307692308), ('purchaser', 1.2272727272727273), ('george', 1.2272727272727273), ('person', 1.2222222222222223), ('courts', 1.2222222222222223), ('house', 1.2105263157894737), ('plaintiff', 1.2096774193548387), ('chattel', 1.2), ('decision', 1.1935483870967742), ('entitled', 1.1818181818181819), ('authority', 1.1666666666666667), ('vendor', 1.1666666666666667), ('dicta', 1.1666666666666667), ('premises', 1.1538461538461537), ('repair', 1.1538461538461537), ('question', 1.1515151515151516), ('pursuer', 1.1428571428571428), ('manufacturer', 1.1384615384615384), ('facts', 1.1333333333333333), ('persons', 1.1333333333333333), ('subject', 1.125), ('class', 1.125), ('scotland', 1.125), ('evidence', 1.125), ('manufacturers', 1.125), ('defender', 1.125), ('contents', 1.1176470588235294), ('words', 1.1), ('longmeid', 1.1), ('holliday 6', 1.1), ('exist', 1.1), ('consequence', 1.1), ('negligence', 1.0985915492957747), ('contract', 1.0918367346938775), ('difficult', 1.0833333333333333), ('proved', 1.0833333333333333), ('respect', 1.0833333333333333), ('respondent', 1.08), ('consumer', 1.0789473684210527), ('proof', 1.0714285714285714), ('regard', 1.0714285714285714), ('manufacture', 1.0666666666666667), ('knowledge', 1.0666666666666667), ('england', 1.0588235294117647), ('langridge', 1.0555555555555556), ('action', 1.0476190476190477), ('opinion', 1.0357142857142858), ('lords', 1.0), ('ginger', 1.0), ('retailer', 1.0), ('result', 1.0), ('neglect', 1.0), ('division', 1.0), ('ground', 1.0), ('fraud', 1.0), ('judgments', 1.0), ('parke', 1.0), ('levy 2', 1.0), ('winterbottom', 1.0), ('wright 10', 1.0), ('stranger', 1.0), ('coach', 1.0), ('reason', 1.0), ('blacker', 1.0), ('breach', 1.0), ('skill', 1.0), ('parties', 1.0), ('brett', 1.0), ('heaven', 1.0), ('point', 1.0), ('treated', 1.0), ('property', 1.0), ('purpose', 1.0), ('thought', 1.0), ('existence', 1.0), ('pointed', 1.0), ('argument', 1.0), ('defendants', 1.0), ('hamilton', 1.0), ('contention', 1.0), ('mullen', 1.0), ('barr &', 1.0), ('defenders', 1.0), ('members', 1.0), ('remote', 1.0), ('bridge', 1.0)]

I was fairly chuffed with these results given it was the first attempt. The key seems to be getting the right balance of parameters when setting the object up. But, it's good to see terms like duty owed and reasonable care appearing at the top of the results. 

It definitely needs some fine tuning and probably an expansion of the stop list, but it's a good start.