Law

Visually clustering case law

I’ve been experimenting with a Python package called Yellowbrick, which provides a suite of visualisers built for gaining an insight into a dataset when working on machine learning problems.

One of my side projects at the moment is looking at ways in which an unseen case law judgment can be processed to determine its high level subject matter (e.g. Crime, Tort, Human Rights, Occupiers’ Liability etc) with text classification.

I did a quick experiment with Yellowbrick to visualise a small portion of the dataset I’m going to use to train a classifier. And here it is:

tSNE projection of 14 case law topic clusters using Yellowbrick

tSNE projection of 14 case law topic clusters using Yellowbrick

The chart above is called a TSNE (t-distributed stochastic neighbour embedding) projection. Each blob on the chart represents a judgment.

I was pretty pleased with this and several quick insights surfaced:

  • Crime may be a bit over represented in my training data, I’ll need to cut it back a bit

  • Some of the labels in the data may need a bit of merging, for example “false . imprisonment” can probably be handled by the “crime” data

  • There are a couple of interesting sub-clusters within the crime data (I’m guessing one of the clusters will be evidence and the other sentencing)

  • Human Rights as a topic sits right in the middle of the field between the crime cluster and the clusters of non-criminal topics

The code

Dependencies

from yellowbrick.text import TSNEVisualizer
from yellowbrick.style import set_palette
from sklearn.feature_extraction.text import TfidfVectorizer
import os

from sklearn.datasets.base import Bunch
from tqdm import tqdm
import matplotlib.pyplot as plt

set_palette('paired')

Load the corpus

def load_corpus(path):
    """
    Loads and wrangles the passed in text corpus by path.
    """

    # Check if the data exists, otherwise download or raise
    if not os.path.exists(path):
        raise ValueError((
            "'{}' dataset has not been downloaded, "
            "use the yellowbrick.download module to fetch datasets"
        ).format(path))

    # Read the directories in the directory as the categories.
    categories = [
        cat for cat in os.listdir(path)
        if os.path.isdir(os.path.join(path, cat))
    ]

    files  = [] # holds the file names relative to the root
    data   = [] # holds the text read from the file
    target = [] # holds the string of the category

    # Load the data from the files in the corpus
    for cat in categories:
        for name in os.listdir(os.path.join(path, cat)):
            files.append(os.path.join(path, cat, name))
            target.append(cat)

            with open(os.path.join(path, cat, name), 'r') as f:
                data.append(f.read())


    # Return the data bunch for use similar to the newsgroups example
    return Bunch(
        categories=categories,
        files=files,
        data=data,
        target=target,
    )  
    corpus = load_corpus('/Users/danielhoadley/Desktop/common_law_subset')

Vectorise and transform the data

tfidf  = TfidfVectorizer(use_idf=True)
docs   = tfidf.fit_transform(corpus.data)
labels = corpus.target

Generate the visualisation

tsne = TSNEVisualizer(size=(1080, 720),title="Case law clusters")
tsne.fit(docs, labels)
tsne.poof()




Case law network graph

I recently posted a few images of a network graph I built with Neo4j depicting the connections between English cases. This article serves as a quick write up on how the graph database and the visualisations where produced.

graph_cluster_distant.png

Data

The data driving the network graph was derived from a subset of XML versions of cases reported by the Incorporated Council of Law Reporting for England and Wales. I used a simple Python script to iterate over the files and capture (a) the citation (e.g. [2010] 1 WLR 1) associated with the file -- source; and (b) all of the citations to other cases within this file -- each outward citation from the source is the target. This was pulled into CSV format, like so:

Source,Target
[2015] 1 WLR 3238,[2015] AC 129
[2015] 1 WLR 3238,[2013] 1 WLR 366
[2015] 1 WLR 3238,[2011] 1 WLR 980

In the snippet of data above, [2015] 1 WLR 3238 can be seen to have CITED three cases, [2015] AC 129, [2013] 1 WLR 366 and [2011] 1 WLR 980. Moreover, [2015] AC 129 can be seen to have been CITED_BY [2015] 1 WLR 3238.

Importing the data into Neo4J

The data was imported into Neo4j with the following CYPHER query:

USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM "file:///citings.csv" AS row
MERGE (c:Case {Name:toString(row.Source)})
MERGE (d:Case {Name:toString(row.Target)})
MERGE (c) -[:CITED]-> (d)
MERGE (d) -[:CITED_BY] -> (c)

The query above is a standard import query that created a node (:Case) for each unique citation in the source data and then constructed two relationships, :CITED and :CITED_BY between each node where these relationships existed.

View of the a small portion of the graph from the Neo4j browser

View of the a small portion of the graph from the Neo4j browser

Calculating the transitive importance of the cases in the graph

With the graph pretty much built, I wanted to get a sense of the most important case in the graph and the PageRank algorithm was used to achieve this:

CALL algo.pageRank('Case', 'CITED_BY',{write: true, writeProperty:'pagerank'})

This stored each case's PageRank as a property, pagerank, on the case node.

It was then possible to identify the ten most important cases in the network by running:

MATCH (c:Case) 
RETURN c.Name, c.pagerank 
ORDER BY c.pagerank DESC LIMIT 10

Which returned:

c.Name,c.pagerank
[2014] 3 WLR 535,15.561027
[2016] Bus LR 1337,13.3335
[2009] 3 WLR 369,11.5683645
[2000] 1 WLR 2068,11.149255000000002
[2009] 3 WLR 351,10.952590499999998
[1996] 1 WLR 1460,10.657869999999999
[2002] 2 WLR 578,9.848398000000001
[2000] 3 WLR 1855,9.2526755
[2005] 1 WLR 2668,8.36525
[2005] 3 WLR 1320,7.990162000000001

Visualising the graph

To render the graph in the browser, I used [neovis.js][1]. The code for the browser render:

<html>
    <head>
        <title>DataViz</title>
        <style type="text/css">
            body {font-family: 'Gotham' !important}
            #viz {
                width: 900px;
                height: 700px;
            }
        </style>
        <script src="https://rawgit.com/neo4j-contrib/neovis.js/master/dist/neovis.js"></script>
    </head>   
    <script>
        function draw() {
            var config = {
                container_id: "viz",
                server_url: "bolt://localhost:7687",
                server_user: "beans",
                server_password: "sausages",
                labels: {
                    "Case": {
                        caption: "Name",
                        size: "pagerank",
                    }
                },
                relationships: {
                    "CITED_BY": {
                        caption: false,                           
                 }
                },
                initial_cypher: "MATCH p=(:Case)-[:CITED]->(:Case) RETURN p LIMIT 5000"
            }
            var viz = new NeoVis.default(config);
            viz.render();
        }
    </script>
    <body onload="draw()">
        <div id="viz"></div>
    </body>
</html>
Visualisation with neovis.js

Visualisation with neovis.js

To add colour to the various groups of cases in the graph, I used a hacky implementation of the label propogation community detection algorithm (I say hacky, because I didn't set any seed labels).

CALL algo.labelPropagation('Case', 'CITED_BY','OUTGOING',
  {iterations:10,partitionProperty:'partition', write:true})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, write, partitionProperty;

The neovis.js could then by updated with a "community" attribute to generate different colours for each community of cases:

<html>
    <head>
        <title>DataViz</title>
        <style type="text/css">
            body {font-family: 'Gotham' !important}
            #viz {
                width: 900px;
                height: 700px;
            }
        </style>
        <script src="https://rawgit.com/neo4j-contrib/neovis.js/master/dist/neovis.js"></script>
    </head>   
    <script>
        function draw() {
            var config = {
                container_id: "viz",
                server_url: "bolt://localhost:7687",
                server_user: "sausages",
                server_password: "beans",
                labels: {
                    "Case": {
                        caption: "Name",
                        size: "pagerank",
                        community: "partition"
                    }
                },
                relationships: {
                    "CITED_BY": {
                        caption: false,    
                    }
                },
                initial_cypher: "MATCH p=(:Case)-[:CITED]->(:Case) RETURN p LIMIT 5000"
            }
            var viz = new NeoVis.default(config);
            viz.render();
        }
    </script>
    <body onload="draw()">
        <div id="viz"></div>
    </body>
</html>

Part 3: Open Access To English Case Law (The Raw Data)

I started writing in the spring of this year about the state of open access to case law in the UK, with a particular focus on judgments given in the courts of England and Wales. 

The gist of my assessment of the state of open access to judgments via the British open law apparatus is set out here, but boils down to:

  • Innovation in the open case law space in the UK is stuck in the mud
  • BAILII is lagging behind comparable projects taking place elsewhere in the common law world: CanLII and CaseText are excellent examples of what's possible.
  • Insufficient focus, if any, is being directed to improving open access to English case law.

In a subsequent article, I explored the value in providing open and free online access to the decisions of judges. I identified four bases upon which open access can be shown to be a worthwhile endeavour: (i) the promotion of the rule of law; (ii) equality of arms, particularly for self-represented litigants; (iii) legal dispute reduction; and (iv) transparency.

In the same article, I developed a rough and ready definition of what "open access to case law":

"Open access to case law" isn't a "thing", it's a goal. The goal, at least to my mind, boils down to providing access that is free at the point of delivery to the text of every judgment given in every case by every court of record (i.e. every court with the power to give judgments that have the potential to be binding on lower and co-ordinate courts) in the jurisdiction.

My overriding concern is that a significant number of judgments do not make their way to BAILII and are only accessible to paying subscribers of subscription databases, effectively creating a "have and have nots" scenario where comprehensive access to the decisions of judges depends on the ability to pay for it. The gaps in BAILII's coverage were discussed in this article.

In this article I go deeper into exploring how big the gaps are in BAILII's coverage when compared to the coverage of judgments provided by three subscription-based research platforms: JustisOne, LexisLibrary and WestlawUK. 

Aim

The aim of the study was gather data on the coverage provided by BAILII, JustisOne, LexisLibrary and WestlawUK of judgments given in the following courts between 2007 and 2017:

  • Administrative and Divisional Court
  • Chancery Division
  • Court of Appeal (Civil Division)
  • Court of Appeal (Criminal Division)
  • Commercial Court
  • Court of Protection
  • Family Court
  • Family Division
  • Patents Court
  • Queen's Bench Division
  • Technology and Construction Court

Methodology

The way in which year-on-year counts of judgments given in a given court are handled by each of the four platforms varies from platform to platform. Accordingly, the following method was devised to extract the data from each platform:

BAILII

BAILII provides an interface to browse its various databases. Within each database, it is possible to isolate a court and a year. The page for a given year of a given court sets out a list of the judgments for that year.

Each judgment appears in the underlying HTML as a list element (<li> ... </li>). For example,

<li><a href="/ew/cases/EWCA/Crim/2017/17.html">Abi-Khalil &amp; Anor, R v </a><a title="Link to BAILII version" href="/ew/cases/EWCA/Crim/2017/17.html">[2017] EWCA Crim 17</a> (13 January 2017)</li>

A count of the total number of each <li> ... </li> on each pages yields the total count of judgments.

Justisone, lexislibrary & westlawuk

The three subscriber platforms were approached differently. A list of search strategies based on the neutral citation for each court was constructed.

For example, to query judgments given in the Criminal Division of the Court of Appeal in 2017, the following query was constructed:

2017 ewca crim

A query for each court and each year was constructed and then submitted by the platform's "citation" search field. The total number of judgments yielded by the query was extracted by capturing the count of results from the platform's underlying HTML.

The Data

The data captured is available here in raw form. The code used to generate the visualisation in this article is available here as a Jupyter Notebook.

annual coverage by publisher

The following graph provides an overview of the annual coverage for all of the courts studied by publisher. The following points leap out of graph:

  • BAILII's coverage of judgments is far lower than that provided by the three subscription-based platforms, running on a rough average of between 2,500-3,000 judgments per year.
  • Save for a drop in LexisLibrary's favour in 2011, JustisOne consistently provides the most comprehensive coverage of judgments.
  • From 2012, Lexis has closely tracked JustisOne's coverage
  • There is a sharp and sudden proportional drop in coverage from 2014 across all four platforms.

The key takeaway from this graph is that a significant number of judgments never make it onto BAILII every year.

BAILIIJustisLexisNexisWestlawUKPublisher20072008200920102011201220132014201520162017Year05001,0001,5002,0002,5003,0003,5004,0004,5005,0005,5006,0006,5007,0007,5008,000Count

The following graph provides an alternative view of the same data. 

BAILIIJustisLexisNexisWestlawUKPublisher2,0004,0006,0008,00010,00012,00014,00016,00018,00020,00022,000024,000Count20072008200920102011201220132014201520162017Year

total coverage of court by publisher

This graph provides an overview of how each publisher fares in terms of coverage of the courts included in the study. By and large, there is a health degree of parity in coverage of the following courts across all four publishers:

  • Chancery Division
  • Commercial Court
  • Court of Protection
  • Family Court
  • Family Division
  • Technology and Construction Courts

However, BAILII is struggling to keep up with the levels of comprehensiveness provided by the commercial publishers in the Administrative Court, both divisions of the Court of Appeal and the Queen's Bench Division. 

The dearth in coverage of judgments from the Criminal Division on BAILII is especially startling, particularly given rise numbers of criminal defendants lacking representation at the sentencing stage. Intuitively (though I have not confirmed this), the deficit in BAILII's coverage of the Criminal Division will almost certainly be judgments following an appeal against sentence. 

BAILIIJustisLexisNexisWestlawUKPublisher20,00025,00030,00035,00040,00045,00050,00055,00060,00065,00070,00075,00015,00010,0005,0000CountAdminChCivCommCrimEWCOPEWFCFamPatentsQBTCCCourt

(Interim) Conclusion

The data shows that BAILII is providing partial access to the overall corpus of judgments handed down in the courts studied. This, as I have previously been at pains to stress, is not down to any failing on BAILII's part. Rather, it is a symptom of how hopeless existing systems (such as they are) are at servicing BAILII with a comprehensive flow of cases to publish, particularly judgments given extempore. 

It also bears saying that the commercial publishers do not in any way obstruct BAILII from acquiring the material. A fuller discussion of the mechanics driving the problem will appear here soon.

Part 2: Open Access To English Case Law (The gaps)

This is the second substantive article in a series of pieces I am preparing in the run up to a talk I'll be giving in June at the annual conference of the British and Irish Association of Law Librarians (BIALL)

To recap, I first issued a primer, in which I essentially say that the state of open access to case law in the UK isn't where it ought to be in 2018 and that our open case law offering is out of step with similar projects elsewhere in the common law world (e.g. Canada and the United States). 

The primer was followed by the first substantive piece, which attempted to (i) define what "open access to case law" actually means and (ii) set out four justifications for providing open access to the decisions of judges. 

Comprehensive coverage of case law

The crucial point I sought to make in the first article was all about what "open access to case law" actually means. I define (perhaps a little crudely) open access to case law in the following terms:

Open access to case law" isn't a "thing", it's a goal. The goal, at least to my mind, boils down to providing access that is free at the point of delivery to the text of every judgment given in every case by every court of record (i.e. every court with the power to give judgments that have the potential to be binding on lower and co-ordinate courts) in the jurisdiction.

My definition places emphasis on comprehensiveness of coverage: the text of every judgment given in every case by every court of record should be freely available. I deliberately avoid folding additional requirements into the medley. I do not, for example, consider the inclusion of summaries and headnotes that explain the judgments to be part of the core mix (though, summaries are very much nice-to-haves). Nor do I say anything about technology (though, it goes without saying that delivery of the scale of comprehensiveness my definition requires could only be achieved with an online platform). 

Currently, the UK's primary open law outlet, BAILII, for reasons I'll go on to develop in the next article, is providing access to only a fraction of the judgments given in the senior courts. That this is the case, it should be noted, is through no fault on the part of BAILII.  

Gaps in BAILII's coverage

The following graph illustrates the problem. The graph is based on a count of the number of judgments given in the Court of Appeal (Criminal Division) with a [2017] EWCA Crim neutral citation. Justis, via their JustisOne platform, provide access to 1,216 judgments from the Criminal Division with a 2017 citation. WestlawUK doesn't fare quite as well, with 967 available Crim Div judgments. Now look at BAILII. Only 230 Criminal Division judgments are available for 2017

Count of judgments with [2017] EWCA Crim citation on BAILII, JustisOne and WestlawUK

Assuming that Justis' total of 1,216 represents the total number of judgments given in the Criminal Division of the Court of Appeal bearing a [2017] neutral citation and that the 230 judgments on BAILII form part of that overall total, we can project a view of the proportion of open-access to closed-access judgments (i.e. access is restricted to an area behind a subscriber paywall).

For anyone out there under the impression that there is any semblance of symmetry between the quantity of judgments available in the open and those accessible behind a paywall, the numbers point emphatically the other way. Taking the JustisOne total of 1,216 as the definitive quantity of 2017 Criminal Division judgments, only 19 percent (less than a fifth!) are freely available. 

The situation so far as availability of judgments flowing from the Civil Division of the Court of Appeal is concerned, is not quite as bad, though it still isn't good.

Court of Appeal (Civil Division)

Count of judgments with [2017] EWCA Civ citation on BAILII, JustisOne and WestlawUK

Again, taking the JustisOne count of 755 Civil Division judgments for 2017 as the definitive total and assuming that the 527 judgments available on BAILII are included in that total, the proportion of open to closed access is 70 percent, which is a good deal better than the criminal content but it's still falling well short of where it should be. 

The reasons underlying the lack of symmetry between open access coverage and the coverage offered by the commercial providers boil down to the hopelessly knackered pipeline that takes the judgments (whether handed down or given extempore) further downstream (more on this in the next article). 

Finally, it also bears saying that the lack of symmetry in coverage is not the fault of the commercial suppliers. They are not operating in a way that prevents BAILII from obtaining the data itself. It's just that the commercial suppliers have precisely what BAILII lacks: the resources to navigate a system of judgment supply that is entirely unfit for purpose and left to rot for far too long.

Part 1: Open Access To English Case Law (Why Bother?)

In June 2018, I'll be giving a plenary talk at the annual meeting of the British and Irish Association of Law Librarians. The topic I've chosen for the talk is open access to English case law. 

In the run up to the talk itself (primarily for the purposes of arranging my thinking on the content of the talk), I'll be releasing a series of articles on various aspects of the current state of open access to English case law. 

I published a "primer" a couple of weeks ago, in which I essentially say that the UK is running a fair bit behind the likes of Canada and the USA in the open access to case law stakes. My view is that notwithstanding the extraordinary contribution BAILII makes to the open law space, there remains considerable room for improvement. 

This is the first substantive article in the series (at least three or four more will follow). 

This article seeks to provide an outline of my thinking on two fundamental questions:

  1. What does "open access to case law" actually mean?
  2. Why bother providing open access to case law, what's the point?

What does "open access to case law" actually mean?

"Open access to case law" isn't a "thing", it's a goal. The goal, at least to my mind, boils down to providing access that is free at the point of delivery to the text of every judgment given in every case by every court of record (i.e. every court with the power to give judgments that have the potential to be binding on lower and co-ordinate courts) in the jurisdiction.

The goal sets a high bar. But it is a goal, after all. And, the attainment of that goal doesn't necessarily require any other bells and whistles. Things like summaries that explain the judgments, beautiful web interfaces, nice APIs and AI are nice to have bonuses, but they're not essential. The goal is first and foremost about providing access to the words used by the court when giving judgment in every case.

Why bother providing open access to case law, what's the point?

If I'm right about the goals of open access to case law, this question can be reformulated as: why bother providing free access to the text of all judgments given in every court? 

Well, at least four answers spring to mind.

The classic "rule of law" answer

In common law systems like ours, judges, in a broad range of circumstances, are able to make new laws or modify the scope of existing laws. There are any number of ways of casting that statement into tighter, more legalist language, but the essential point is that the words used by judges can, and often do, change the list of rules that govern what we can and can't do and the penalties we are liable to incur if we break those rules.

Because of this, in an ideal world, We, the People, would have some way of finding out what those rules are so that we're able to regulate our conduct to ensure we don't break them and to know what our rights are if we suffer as a result of someone else's breach of the rules.

The closer we move towards the "open access to case law" goal, the closer we get to being able to identify the rules we're expected to play by. 

The "equality of arms" answer

Accurately working out what the law says on issue X, Y or Z is not easy. We will often need an expert to help us determine what the law says on issue X, Y or Z and to help us understand our position in relation to it. These experts are called lawyers and lawyers cost money (generally, lots of money). 

The party to a dispute with access to a lawyer should (if their lawyer is any good) have at least two advantages over the party that does not have access to a lawyer:

  1. They will have the advantage of an advisor with expertise in the substantive law applicable to the dispute, which gives them an obvious head start.
  2. They will have the advantage of an advisor who has the advantage of multiple, industrial-strength tools to help them determine what the applicable law actually says .

The party to the dispute who lacks the means to access these two considerable advantages is therefore obviously at a correlating disadvantage. They're outgunned and probably outnumbered. There is an inequality of arms. Cuts in legal aid and, in many cases, the absence of legal aid altogether, increase the number of disputes in which one side is bringing a knife to a gunfight.

The closer we move towards the "open access to case law" goal, the more that state of inequality is reduced. Even if true equality cannot be achieved, some degree of access to the material governing the determination of who is likely to win and who is likely to lose begins to level the playing field. That's a good thing.

The "dispute reduction" answer

The ability to form a reasonably accurate view of what the rules say on issue X, Y or Z increases our ability to intelligently pick our battles and to nip disputes in the bud before they go anywhere near a court or some other costly method of dispute resolution. 

It may be hard to swallow, but if a litigant-to-be at least has the means of establishing that they probably don't have a leg to stand on (or has the other side bang to rights), more disagreements can be dealt with before lawyers get involved and things start to grow arms and legs. 

The closer we move towards the "open access to case law" goal, the greater our ability to resolve disputes before they morph into nasty, expensive, protracted and minified echoes of Jarndyce v Jarndyce

The "public information" answer

Courts are public institutions financed by public funds. Judgments are their unit of activity. Those units of activity should be open to public scrutiny and study. Judgments are public information (unless, there's a good reason to keep their content secret).

It may well be that nobody ever bothers to look at them. But that's not the point. The point is that if judgments are only meaningfully accessible on systems we have to pay to access, they're not meaningfully accessible to the public. 

Conclusion

In this short article, I've proposed a rough and ready definition of what "open access to case law" is and four justifications as to why it is a worthwhile pursuit. In the next article, I'm going to lock in on the nitty gritty of the state of open access to case law in the United Kingdom.

Open access to English case law (a Primer)

TL;DR

  • Innovation in the open case law space in the UK is stuck in the mud
  • BAILII is lagging behind comparable projects taking place elsewhere in the common law world: CanLII and CaseText are excellent examples of what's possible.
  • Insufficient focus, if any, is being directed to improving open access to English case law

There is a tsunami of innovation happening in the legal space right now. The problem is, so far as I can tell, none of it is being directed towards improving the way the decisions of judges in the English courts are made accessible to the wider public. 

Innovation in the pursuit of achieving broader, more intuitive and freer access to English case law has laid stagnant for at least five years. It is true that the United Kingdom has BAILII and nothing that follows in this series of blog posts is intended to take anything away from how important BAILII is or how successful it has been in opening access to the decisions of judges. However, BAILII (through no fault of its own) has been unable to keep pace with the levels of really positive innovation I've observed in similar projects taking place outside the UK (notably BAILII's Canadian equivalent, CanLII, and the US freemium/premium case law platform, CaseText). 

Open access to case law in the United Kingdom suffers from the following weaknesses (this list is by no means exhaustive):

  1. Gaps in coverage: there are too many gaps in the legacy case law archive and there are too many gaps in ongoing coverage of new judgments, especially those that are given extempore. There is still a vast amount of retrospective and prospective material that can only be accessed via paid subscription services.
  2. User-friendliness: BAILII is simple enough to use if you're used to researching the law online, but there is a considerable amount that could be done to improve the service for the benefit of lay users. 
  3. Sustainability: plenty of people use BAILII, but very few of them make donations to help BAILII raise enough financial resource to pursue product development projects.
  4. No platform for experimentation or third-party development: unlike CanLII, BAILII doesn't have a public API. Third-party innovation has stalled because it is incredibly difficult to acquire access to the text of the cases.

The weaknesses I've set out above are a function of the following broader problems (again, this list isn't exhaustive):

  1. The supply chain that takes a judgment (whether handed down or given extempore) to the wider public is messy and poorly understood by the Ministry of Justice (which is worrying, because they control that supply chain).
  2. Intellectual property rights over the judgments themselves is needlessly uncertain.
  3. There is no solid model for translating the way the common law works to the sort of open case law system we need.
  4.  BAILII, in several key ways, itself acts like a publisher of proprietary content.

This post is a "primer" for a series of blogs posts I'm writing on the subject in the run-up to a talk I'll be giving at the British and Irish Association of Law Librarian's in June 2018. 

Sentiment in Case Law

Created with Sketch.

For the past few months, I've been exploring various methods of unlocking interesting data from case law. This post discusses the ways in which data science techniques can be used to analyse the sentiment of the text of judgments.

The focus in this post is mainly technical and describes the steps I've taken using a statistical programming language, called R, to extract an "emotional" profile of three cases.

I have yet to arrive at any firm hypothesis of how this sort of technique can be used to draw conclusions that would necessarily be of use in court or during case preparation, but I hope some of the initial results canvassed below are interest to a few. 


TidyText is an incredibly effective and approachable package in R for text mining that I stumbled across when flicking through some of the Studio::Conf 2017 materials a few days ago. 

There's loads of information available about the TidyText package, along with its underlying philosophy, but this post focuses on an implementation of one small aspect of the package's power: the ability to analyse and plot the sentiment of words in documents.

MY TEST DATA

I'm using a small dataset for this walkthrough that consists of three court judgments: two judgments of the UK Supreme Court and one from its predecessor, the Judicial Committee of the House of Lords:

The subject matter of the dataset isn't really that important. My purpose here is to use the tools TidyText makes available to chart the emotional attributes of the words used in these judgments over the course of each document. 

GET THE DATA READY FOR ANALYSIS

First off, we need to get our environment ready and set our working directory:

# Environment
library(tidyverse)
library(tidytext)
library(stringr)

# Data - setwd to the folder that contains the data you want to work with
setwd("~/Documents/R/TidyText")

Next, we're going to get the text of the files (in my case, the three judgments) into a data frame:

case_words <- data_frame(file = paste0(c("evans.txt", "miller.txt", "donoghue.txt"))) %>%
mutate(text = map(file, read_lines))

This gives us a tibble with a single variable equal to the name of the file. We now need to unnest that tibble so that we have the document, line numbers and words as columns.

case_words <- case_words %>%
unnest() %>%
mutate(line_number = 1:n(),
 file = str_sub(basename(file), 1))
case_words$file <- forcats::fct_relevel(case_words$file, c("evans", "miller", "donoghue"))

We get a tibble that looks like this:

# A tibble: 96,318 × 3
file line_numberword
<fctr> <int> <chr>
1evans.txt 1lord
2evans.txt 1 neuberger
3evans.txt 1with
4evans.txt 1whom
5evans.txt 1lord
6evans.txt 1kerr
7evans.txt 1 and
8evans.txt 1lord
9evans.txt 1reed
10 evans.txt 1 agree
# ... with 96,308 more rows

You can check the state of your table at this point by running,

head(case_words)

The last thing we need to do before we're ready to begin computing the sentiment of the data is to tokenise the words in our tibble:

case_words <- case_words %>%
unnest_tokens(word, text) 

SENTIMENT ANALYSIS OF THE JUDGMENTS

We are ready to start analysing the sentiment of the data. TidyText is armed with three different sentiment dictionaries, afinn, nrc and Bing. The first thing we're going to do is get a birds eye view of the different sentiment profiles of each judgment using the nrc dictionary and plot the results using ggplot:

case_words %>%
inner_join(get_sentiments("nrc")) %>%
group_by(index = line_number %/% 20, file, sentiment) %>%
summarize(n = n()) %>%
ggplot(aes(x = index, y = n, fill = file)) + 
geom_bar(stat = "identity", alpha = 0.7) + 
facet_wrap(~ sentiment, ncol = 3)

A bird-eye view of ten emotional profiles of each judgment

The x-axis of each graph represents the position within each document from beginning to end, the y-axis quantifies the intensity of the sentiment under analysis. 

We can get a closer look at the emotional fluctuation by plotting an analysis using the afinn and Bing dictionaries:

case_words %>% 
left_join(get_sentiments("bing")) %>%
left_join(get_sentiments("afinn")) %>%
group_by(index = line_number %/% 20, file) %>%
summarize(afinn = mean(score, na.rm = TRUE), 
bing = sum(sentiment == "positive", na.rm = TRUE) - sum(sentiment == "negative", na.rm = TRUE)) %>%
gather(lexicon, lexicon_score, afinn, bing) %>% 
ggplot(aes(x = index, y = lexicon_score, colour = file)) +
geom_smooth(stat = "identity") + 
facet_wrap(~ lexicon, scale = "free_y") +
scale_x_continuous("Location in judgment", breaks = NULL) +
scale_y_continuous("Lexicon Score")

Sentiment curves using afinn and bing sentiment dictionaries

FINDINGS

The Bing analysis (pictured right), appears to provide a slightly more stable view. Instances moving above the zero-line indicate positive emotion, instances moving below the zero line indicate negative emotion.

For example, if we take the judgment in R (Miller), we can see that the first half of the judgment is broadly positive, but then dips suddenly around the middle of the judgment. The line indicates that the second half of the judgment text is slightly more negative than the first half, but rises to it's peak of positivity just before the end of the text.

The text of the judgment in Donoghue is considerably more negative. The curve sharply dips as the judgment opens, turns more positive towards the middle of the document, takes a negative turn once more and resolves to a more positive state towards the conclusion.

Legislation.gov.uk Statute Scraper

The following Python script can be used to scrape the full text of Public General Acts from legislation.gov.uk.

Prerequisites

The script takes a list of URLs to individual pieces of legislation from a text file. The script processes each URL one by one. The text file needs to look something like this, with each target URL on a new line:

url.txt

http://www.legislation.gov.uk/ukpga/2016/4 
http://www.legislation.gov.uk/ukpga/2016/3 
http://www.legislation.gov.uk/ukpga/2016/2 
http://www.legislation.gov.uk/ukpga/2016/1 
http://www.legislation.gov.uk/ukpga/2017/1 
http://www.legislation.gov.uk/ukpga/2017/2

The Python script is simple enough. 

  • First, it opens url.txt and reads the target URLs line by line
  • For each target URL, the title of the legislation is captured (this is used to name the output files)
  • Each url is cycled through sequentially and the contents of the relevant part of the HTML markup is scraped
  • The scraped material is written to a text file and an prettified HTML file.

Scraper.py

# Environment

import requests
import time
import io
from bs4 import BeautifulSoup
from urllib import urlopen

# Get the text file with the URLs to be scraped and scrape the target section in each page

print "\n\nScraping URLs in urls.txt...\n\n"

with open('urls.txt') as inf:
    
    # Get each url on each line in urls.txt
    urls = (line.strip() for line in inf)
    for url in urls:
        site = urlopen(url)
        soup = BeautifulSoup(site, "lxml")
        
        # Scrape the name of the legislation in each target url for use when saving the output to a file
        
        for legName in soup.find_all("h1", {"class": "pageTitle"}):
            
            actTitle = legName.text
                
                print 'Scraping ' + actTitle + ' ...\n'
    
        # Scrape stuff in <div id="viewLegContents"></div>
        
        for item in soup.find_all("div", {"id": "viewLegContents"}):
            
                # Write what we've scraped, with UTF-8 encoding, as text to a new text file - one file per url
                
                with io.open (actTitle + '.txt', 'w', encoding='utf-8') as g:
                    g.write(item.text)
                    
                    # Write what we've scraped to an html file - one file per url
                    
                    with open (actTitle + '.html', 'w') as g:
                        g.write(item.prettify('utf-8'))

print "\n\nDone! Files created.\n"

Respect the source of the data you're scraping

The people behind legislation.gov.uk have done everyone a big favour in making their information so accessible. The least we can do is to be respectful of their servers when performing scraping tasks like this. If you plan on running this, I'd strongly urge you to break your url.txt input into small chunks and, if you go on to reuse the data, remember to acknowledge the source of that data.