Data Vis

Visually clustering case law

I’ve been experimenting with a Python package called Yellowbrick, which provides a suite of visualisers built for gaining an insight into a dataset when working on machine learning problems.

One of my side projects at the moment is looking at ways in which an unseen case law judgment can be processed to determine its high level subject matter (e.g. Crime, Tort, Human Rights, Occupiers’ Liability etc) with text classification.

I did a quick experiment with Yellowbrick to visualise a small portion of the dataset I’m going to use to train a classifier. And here it is:

tSNE projection of 14 case law topic clusters using Yellowbrick

tSNE projection of 14 case law topic clusters using Yellowbrick

The chart above is called a TSNE (t-distributed stochastic neighbour embedding) projection. Each blob on the chart represents a judgment.

I was pretty pleased with this and several quick insights surfaced:

  • Crime may be a bit over represented in my training data, I’ll need to cut it back a bit

  • Some of the labels in the data may need a bit of merging, for example “false . imprisonment” can probably be handled by the “crime” data

  • There are a couple of interesting sub-clusters within the crime data (I’m guessing one of the clusters will be evidence and the other sentencing)

  • Human Rights as a topic sits right in the middle of the field between the crime cluster and the clusters of non-criminal topics

The code

Dependencies

from yellowbrick.text import TSNEVisualizer
from yellowbrick.style import set_palette
from sklearn.feature_extraction.text import TfidfVectorizer
import os

from sklearn.datasets.base import Bunch
from tqdm import tqdm
import matplotlib.pyplot as plt

set_palette('paired')

Load the corpus

def load_corpus(path):
    """
    Loads and wrangles the passed in text corpus by path.
    """

    # Check if the data exists, otherwise download or raise
    if not os.path.exists(path):
        raise ValueError((
            "'{}' dataset has not been downloaded, "
            "use the yellowbrick.download module to fetch datasets"
        ).format(path))

    # Read the directories in the directory as the categories.
    categories = [
        cat for cat in os.listdir(path)
        if os.path.isdir(os.path.join(path, cat))
    ]

    files  = [] # holds the file names relative to the root
    data   = [] # holds the text read from the file
    target = [] # holds the string of the category

    # Load the data from the files in the corpus
    for cat in categories:
        for name in os.listdir(os.path.join(path, cat)):
            files.append(os.path.join(path, cat, name))
            target.append(cat)

            with open(os.path.join(path, cat, name), 'r') as f:
                data.append(f.read())


    # Return the data bunch for use similar to the newsgroups example
    return Bunch(
        categories=categories,
        files=files,
        data=data,
        target=target,
    )  
    corpus = load_corpus('/Users/danielhoadley/Desktop/common_law_subset')

Vectorise and transform the data

tfidf  = TfidfVectorizer(use_idf=True)
docs   = tfidf.fit_transform(corpus.data)
labels = corpus.target

Generate the visualisation

tsne = TSNEVisualizer(size=(1080, 720),title="Case law clusters")
tsne.fit(docs, labels)
tsne.poof()




Case law network graph

I recently posted a few images of a network graph I built with Neo4j depicting the connections between English cases. This article serves as a quick write up on how the graph database and the visualisations where produced.

graph_cluster_distant.png

Data

The data driving the network graph was derived from a subset of XML versions of cases reported by the Incorporated Council of Law Reporting for England and Wales. I used a simple Python script to iterate over the files and capture (a) the citation (e.g. [2010] 1 WLR 1) associated with the file -- source; and (b) all of the citations to other cases within this file -- each outward citation from the source is the target. This was pulled into CSV format, like so:

Source,Target
[2015] 1 WLR 3238,[2015] AC 129
[2015] 1 WLR 3238,[2013] 1 WLR 366
[2015] 1 WLR 3238,[2011] 1 WLR 980

In the snippet of data above, [2015] 1 WLR 3238 can be seen to have CITED three cases, [2015] AC 129, [2013] 1 WLR 366 and [2011] 1 WLR 980. Moreover, [2015] AC 129 can be seen to have been CITED_BY [2015] 1 WLR 3238.

Importing the data into Neo4J

The data was imported into Neo4j with the following CYPHER query:

USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM "file:///citings.csv" AS row
MERGE (c:Case {Name:toString(row.Source)})
MERGE (d:Case {Name:toString(row.Target)})
MERGE (c) -[:CITED]-> (d)
MERGE (d) -[:CITED_BY] -> (c)

The query above is a standard import query that created a node (:Case) for each unique citation in the source data and then constructed two relationships, :CITED and :CITED_BY between each node where these relationships existed.

View of the a small portion of the graph from the Neo4j browser

View of the a small portion of the graph from the Neo4j browser

Calculating the transitive importance of the cases in the graph

With the graph pretty much built, I wanted to get a sense of the most important case in the graph and the PageRank algorithm was used to achieve this:

CALL algo.pageRank('Case', 'CITED_BY',{write: true, writeProperty:'pagerank'})

This stored each case's PageRank as a property, pagerank, on the case node.

It was then possible to identify the ten most important cases in the network by running:

MATCH (c:Case) 
RETURN c.Name, c.pagerank 
ORDER BY c.pagerank DESC LIMIT 10

Which returned:

c.Name,c.pagerank
[2014] 3 WLR 535,15.561027
[2016] Bus LR 1337,13.3335
[2009] 3 WLR 369,11.5683645
[2000] 1 WLR 2068,11.149255000000002
[2009] 3 WLR 351,10.952590499999998
[1996] 1 WLR 1460,10.657869999999999
[2002] 2 WLR 578,9.848398000000001
[2000] 3 WLR 1855,9.2526755
[2005] 1 WLR 2668,8.36525
[2005] 3 WLR 1320,7.990162000000001

Visualising the graph

To render the graph in the browser, I used [neovis.js][1]. The code for the browser render:

<html>
    <head>
        <title>DataViz</title>
        <style type="text/css">
            body {font-family: 'Gotham' !important}
            #viz {
                width: 900px;
                height: 700px;
            }
        </style>
        <script src="https://rawgit.com/neo4j-contrib/neovis.js/master/dist/neovis.js"></script>
    </head>   
    <script>
        function draw() {
            var config = {
                container_id: "viz",
                server_url: "bolt://localhost:7687",
                server_user: "beans",
                server_password: "sausages",
                labels: {
                    "Case": {
                        caption: "Name",
                        size: "pagerank",
                    }
                },
                relationships: {
                    "CITED_BY": {
                        caption: false,                           
                 }
                },
                initial_cypher: "MATCH p=(:Case)-[:CITED]->(:Case) RETURN p LIMIT 5000"
            }
            var viz = new NeoVis.default(config);
            viz.render();
        }
    </script>
    <body onload="draw()">
        <div id="viz"></div>
    </body>
</html>
Visualisation with neovis.js

Visualisation with neovis.js

To add colour to the various groups of cases in the graph, I used a hacky implementation of the label propogation community detection algorithm (I say hacky, because I didn't set any seed labels).

CALL algo.labelPropagation('Case', 'CITED_BY','OUTGOING',
  {iterations:10,partitionProperty:'partition', write:true})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, write, partitionProperty;

The neovis.js could then by updated with a "community" attribute to generate different colours for each community of cases:

<html>
    <head>
        <title>DataViz</title>
        <style type="text/css">
            body {font-family: 'Gotham' !important}
            #viz {
                width: 900px;
                height: 700px;
            }
        </style>
        <script src="https://rawgit.com/neo4j-contrib/neovis.js/master/dist/neovis.js"></script>
    </head>   
    <script>
        function draw() {
            var config = {
                container_id: "viz",
                server_url: "bolt://localhost:7687",
                server_user: "sausages",
                server_password: "beans",
                labels: {
                    "Case": {
                        caption: "Name",
                        size: "pagerank",
                        community: "partition"
                    }
                },
                relationships: {
                    "CITED_BY": {
                        caption: false,    
                    }
                },
                initial_cypher: "MATCH p=(:Case)-[:CITED]->(:Case) RETURN p LIMIT 5000"
            }
            var viz = new NeoVis.default(config);
            viz.render();
        }
    </script>
    <body onload="draw()">
        <div id="viz"></div>
    </body>
</html>

Part 3: Open Access To English Case Law (The Raw Data)

I started writing in the spring of this year about the state of open access to case law in the UK, with a particular focus on judgments given in the courts of England and Wales. 

The gist of my assessment of the state of open access to judgments via the British open law apparatus is set out here, but boils down to:

  • Innovation in the open case law space in the UK is stuck in the mud
  • BAILII is lagging behind comparable projects taking place elsewhere in the common law world: CanLII and CaseText are excellent examples of what's possible.
  • Insufficient focus, if any, is being directed to improving open access to English case law.

In a subsequent article, I explored the value in providing open and free online access to the decisions of judges. I identified four bases upon which open access can be shown to be a worthwhile endeavour: (i) the promotion of the rule of law; (ii) equality of arms, particularly for self-represented litigants; (iii) legal dispute reduction; and (iv) transparency.

In the same article, I developed a rough and ready definition of what "open access to case law":

"Open access to case law" isn't a "thing", it's a goal. The goal, at least to my mind, boils down to providing access that is free at the point of delivery to the text of every judgment given in every case by every court of record (i.e. every court with the power to give judgments that have the potential to be binding on lower and co-ordinate courts) in the jurisdiction.

My overriding concern is that a significant number of judgments do not make their way to BAILII and are only accessible to paying subscribers of subscription databases, effectively creating a "have and have nots" scenario where comprehensive access to the decisions of judges depends on the ability to pay for it. The gaps in BAILII's coverage were discussed in this article.

In this article I go deeper into exploring how big the gaps are in BAILII's coverage when compared to the coverage of judgments provided by three subscription-based research platforms: JustisOne, LexisLibrary and WestlawUK. 

Aim

The aim of the study was gather data on the coverage provided by BAILII, JustisOne, LexisLibrary and WestlawUK of judgments given in the following courts between 2007 and 2017:

  • Administrative and Divisional Court
  • Chancery Division
  • Court of Appeal (Civil Division)
  • Court of Appeal (Criminal Division)
  • Commercial Court
  • Court of Protection
  • Family Court
  • Family Division
  • Patents Court
  • Queen's Bench Division
  • Technology and Construction Court

Methodology

The way in which year-on-year counts of judgments given in a given court are handled by each of the four platforms varies from platform to platform. Accordingly, the following method was devised to extract the data from each platform:

BAILII

BAILII provides an interface to browse its various databases. Within each database, it is possible to isolate a court and a year. The page for a given year of a given court sets out a list of the judgments for that year.

Each judgment appears in the underlying HTML as a list element (<li> ... </li>). For example,

<li><a href="/ew/cases/EWCA/Crim/2017/17.html">Abi-Khalil &amp; Anor, R v </a><a title="Link to BAILII version" href="/ew/cases/EWCA/Crim/2017/17.html">[2017] EWCA Crim 17</a> (13 January 2017)</li>

A count of the total number of each <li> ... </li> on each pages yields the total count of judgments.

Justisone, lexislibrary & westlawuk

The three subscriber platforms were approached differently. A list of search strategies based on the neutral citation for each court was constructed.

For example, to query judgments given in the Criminal Division of the Court of Appeal in 2017, the following query was constructed:

2017 ewca crim

A query for each court and each year was constructed and then submitted by the platform's "citation" search field. The total number of judgments yielded by the query was extracted by capturing the count of results from the platform's underlying HTML.

The Data

The data captured is available here in raw form. The code used to generate the visualisation in this article is available here as a Jupyter Notebook.

annual coverage by publisher

The following graph provides an overview of the annual coverage for all of the courts studied by publisher. The following points leap out of graph:

  • BAILII's coverage of judgments is far lower than that provided by the three subscription-based platforms, running on a rough average of between 2,500-3,000 judgments per year.
  • Save for a drop in LexisLibrary's favour in 2011, JustisOne consistently provides the most comprehensive coverage of judgments.
  • From 2012, Lexis has closely tracked JustisOne's coverage
  • There is a sharp and sudden proportional drop in coverage from 2014 across all four platforms.

The key takeaway from this graph is that a significant number of judgments never make it onto BAILII every year.

BAILIIJustisLexisNexisWestlawUKPublisher20072008200920102011201220132014201520162017Year05001,0001,5002,0002,5003,0003,5004,0004,5005,0005,5006,0006,5007,0007,5008,000Count

The following graph provides an alternative view of the same data. 

BAILIIJustisLexisNexisWestlawUKPublisher2,0004,0006,0008,00010,00012,00014,00016,00018,00020,00022,000024,000Count20072008200920102011201220132014201520162017Year

total coverage of court by publisher

This graph provides an overview of how each publisher fares in terms of coverage of the courts included in the study. By and large, there is a health degree of parity in coverage of the following courts across all four publishers:

  • Chancery Division
  • Commercial Court
  • Court of Protection
  • Family Court
  • Family Division
  • Technology and Construction Courts

However, BAILII is struggling to keep up with the levels of comprehensiveness provided by the commercial publishers in the Administrative Court, both divisions of the Court of Appeal and the Queen's Bench Division. 

The dearth in coverage of judgments from the Criminal Division on BAILII is especially startling, particularly given rise numbers of criminal defendants lacking representation at the sentencing stage. Intuitively (though I have not confirmed this), the deficit in BAILII's coverage of the Criminal Division will almost certainly be judgments following an appeal against sentence. 

BAILIIJustisLexisNexisWestlawUKPublisher20,00025,00030,00035,00040,00045,00050,00055,00060,00065,00070,00075,00015,00010,0005,0000CountAdminChCivCommCrimEWCOPEWFCFamPatentsQBTCCCourt

(Interim) Conclusion

The data shows that BAILII is providing partial access to the overall corpus of judgments handed down in the courts studied. This, as I have previously been at pains to stress, is not down to any failing on BAILII's part. Rather, it is a symptom of how hopeless existing systems (such as they are) are at servicing BAILII with a comprehensive flow of cases to publish, particularly judgments given extempore. 

It also bears saying that the commercial publishers do not in any way obstruct BAILII from acquiring the material. A fuller discussion of the mechanics driving the problem will appear here soon.

Sentiment in Case Law

Created with Sketch.

For the past few months, I've been exploring various methods of unlocking interesting data from case law. This post discusses the ways in which data science techniques can be used to analyse the sentiment of the text of judgments.

The focus in this post is mainly technical and describes the steps I've taken using a statistical programming language, called R, to extract an "emotional" profile of three cases.

I have yet to arrive at any firm hypothesis of how this sort of technique can be used to draw conclusions that would necessarily be of use in court or during case preparation, but I hope some of the initial results canvassed below are interest to a few. 


TidyText is an incredibly effective and approachable package in R for text mining that I stumbled across when flicking through some of the Studio::Conf 2017 materials a few days ago. 

There's loads of information available about the TidyText package, along with its underlying philosophy, but this post focuses on an implementation of one small aspect of the package's power: the ability to analyse and plot the sentiment of words in documents.

MY TEST DATA

I'm using a small dataset for this walkthrough that consists of three court judgments: two judgments of the UK Supreme Court and one from its predecessor, the Judicial Committee of the House of Lords:

The subject matter of the dataset isn't really that important. My purpose here is to use the tools TidyText makes available to chart the emotional attributes of the words used in these judgments over the course of each document. 

GET THE DATA READY FOR ANALYSIS

First off, we need to get our environment ready and set our working directory:

# Environment
library(tidyverse)
library(tidytext)
library(stringr)

# Data - setwd to the folder that contains the data you want to work with
setwd("~/Documents/R/TidyText")

Next, we're going to get the text of the files (in my case, the three judgments) into a data frame:

case_words <- data_frame(file = paste0(c("evans.txt", "miller.txt", "donoghue.txt"))) %>%
mutate(text = map(file, read_lines))

This gives us a tibble with a single variable equal to the name of the file. We now need to unnest that tibble so that we have the document, line numbers and words as columns.

case_words <- case_words %>%
unnest() %>%
mutate(line_number = 1:n(),
 file = str_sub(basename(file), 1))
case_words$file <- forcats::fct_relevel(case_words$file, c("evans", "miller", "donoghue"))

We get a tibble that looks like this:

# A tibble: 96,318 × 3
file line_numberword
<fctr> <int> <chr>
1evans.txt 1lord
2evans.txt 1 neuberger
3evans.txt 1with
4evans.txt 1whom
5evans.txt 1lord
6evans.txt 1kerr
7evans.txt 1 and
8evans.txt 1lord
9evans.txt 1reed
10 evans.txt 1 agree
# ... with 96,308 more rows

You can check the state of your table at this point by running,

head(case_words)

The last thing we need to do before we're ready to begin computing the sentiment of the data is to tokenise the words in our tibble:

case_words <- case_words %>%
unnest_tokens(word, text) 

SENTIMENT ANALYSIS OF THE JUDGMENTS

We are ready to start analysing the sentiment of the data. TidyText is armed with three different sentiment dictionaries, afinn, nrc and Bing. The first thing we're going to do is get a birds eye view of the different sentiment profiles of each judgment using the nrc dictionary and plot the results using ggplot:

case_words %>%
inner_join(get_sentiments("nrc")) %>%
group_by(index = line_number %/% 20, file, sentiment) %>%
summarize(n = n()) %>%
ggplot(aes(x = index, y = n, fill = file)) + 
geom_bar(stat = "identity", alpha = 0.7) + 
facet_wrap(~ sentiment, ncol = 3)

A bird-eye view of ten emotional profiles of each judgment

The x-axis of each graph represents the position within each document from beginning to end, the y-axis quantifies the intensity of the sentiment under analysis. 

We can get a closer look at the emotional fluctuation by plotting an analysis using the afinn and Bing dictionaries:

case_words %>% 
left_join(get_sentiments("bing")) %>%
left_join(get_sentiments("afinn")) %>%
group_by(index = line_number %/% 20, file) %>%
summarize(afinn = mean(score, na.rm = TRUE), 
bing = sum(sentiment == "positive", na.rm = TRUE) - sum(sentiment == "negative", na.rm = TRUE)) %>%
gather(lexicon, lexicon_score, afinn, bing) %>% 
ggplot(aes(x = index, y = lexicon_score, colour = file)) +
geom_smooth(stat = "identity") + 
facet_wrap(~ lexicon, scale = "free_y") +
scale_x_continuous("Location in judgment", breaks = NULL) +
scale_y_continuous("Lexicon Score")

Sentiment curves using afinn and bing sentiment dictionaries

FINDINGS

The Bing analysis (pictured right), appears to provide a slightly more stable view. Instances moving above the zero-line indicate positive emotion, instances moving below the zero line indicate negative emotion.

For example, if we take the judgment in R (Miller), we can see that the first half of the judgment is broadly positive, but then dips suddenly around the middle of the judgment. The line indicates that the second half of the judgment text is slightly more negative than the first half, but rises to it's peak of positivity just before the end of the text.

The text of the judgment in Donoghue is considerably more negative. The curve sharply dips as the judgment opens, turns more positive towards the middle of the document, takes a negative turn once more and resolves to a more positive state towards the conclusion.