As someone involved in the ongoing development of an online legal research system (the ICLR's ICLR.3 platform), I spend quite a bit of time thinking about the ways in which unstructured or partially structured legal texts can be enriched and brought to order, either to prepare the text for later processing in a content delivery pipeline or for some other form of data analysis.
More often than not, rendering a text amenable for content delivery or data analysis involves a fair amount of wrangling with the text itself to markup entities of interest and to apply an overall schematic structure to document.
Legal publishers, such as ICLR, Justis, LexisNexis and Thomson Reuters use industrial-strength proprietary tools and teams of people to wrangle unstructured legal material into a form that can be used in their products and services. However, the pool of individuals and companies interested in leveraging legal texts has exploded well beyond a handful of well-established legal publishers.
In my opinion, the more people playing with legal information and sharing their work the better. So, I've started development on my very first open source project to produce a suite of tools, written in Python, that can be used to perform a wide range of legal text enrichment operations. I call the project Blackstone.
Just started writing my first open source project: a #Python library specifically for parsing and enriching legal texts to prepare them for #datascience or content delivery pipelines. Starting simple and will gradually expand outwards. #legaltech— Daniel Hoadley (@DanHLawReporter) September 12, 2018
The idea behind Blackstone is relatively simple: it should be easier to perform a standard set of extraction and enrichment tasks without first having to write custom code to get the job done. The objective of the library is to provide a free set of tools that can be used to:
Automatically segment the input text into sentences and mark them up
Identify and markup references to primary and secondary legislation
Identify and markup references to case law
Identify and markup axioms (e.g. where the author of the text postulates that such and such is an "established principle of law" etc)
Identify other types of entities peculiar to legal writing, such as courts, indictment numbers
Produce document level metrics, providing an overview of the document's structure, characteristics and content
Generation of visualisations
Other stuff I haven't thought of yet
Crucially, Blackstone is not intended to be a standalone service. Rather, the intention is to provide a suite of ready-baked Python tools that can be used out of the box in other development or data science pipelines.
As an open source library, Blackstone stands on the shoulders of world-class, open Python technologies: spaCy, scikit-learn, BeautifulSoup, pandas, requests and, of course, Python's own standard library. Blackstone couples intuitive high-level abstractions of these underlying technologies with custom built constructs designed specifically to deal with legal content.
Progress and horizon
The plan is to get an initial Beta release out on GitHub and PyPi by the end of September 2018. To date, the following progress has been made:
Function to provide high-level abstraction over spaCy sentence segmentation (testing)
Function to assemble comprehensive list of UK statutes (complete)
Function to detect and markup primary legislation by reference to short title (complete)
Function to detect and markup primary legislation by reference to abbreviation (e.g. DPA or DPA 1998) (testing)
Function to resolve oblique references to primary legislation (e.g. the 1998 Act) (developing).
Once I've got a baseline level of functionality completed, I'll release the code on GitHub. More updates to follow.
If you'd like to get involved, share an idea or give me some help, drop me a line on Twitter.