Marking up references to UK legislation in unstructured text

This is a quick example of a reliable (but not a particularly extensible) implementation for marking up references to UK primary legislation in an unstructured input text.

The implementation is reliable, because it uses a list of statute short titles derived from and passes these into the source text as straightforward regular expression search patterns. This, however, makes the implementation a bit of a bore to easily extend, because the code does not account for new pieces of legislation added to the statute book since the code was written!

Set up enironment

import bs4 as BeautifulSoup
import urllib3
import pandas as pd
import csv
import re

http = urllib3.PoolManager()

Core objects and URL creation

Create objects that will be used throughout the script and build a list of target URLs.


#This is a little bit hacky. I'm essentially running an empty search on and iterating over each page in the results.
scope = range(1,289)
base_url = ""

for year in scope:
    target_url = base_url + "page=" + str(year)
    TARGETS.append(target_url )

Perform the scrape

Scrape each target, pulling in the required text content from the results table.

for target in TARGETS:
    response = http.request('GET', target)
    soup = BeautifulSoup.BeautifulSoup(, "html.parser")

    td = soup.find_all('td')
    for i in td:

        children = i.findChildren("a" , recursive=True)
        for child in children:
            statute_name = child.text

Output >>>

['Domestic Gas and Electricity (Tariff Cap) Act 2018',
 '2018\xa0c. 21',
 'Northern Ireland Budget Act 2018',
 '2018\xa0c. 20',
 'Haulage Permits and Trailer Registration Act 2018']

Clean the captured data and store it

The scrape pulls in unwanted material in the form of chapter numbers owing to lack of precision in the source markup. Unwanted captures are dropped using a regular expression and the data is stored in a pd.DataFrame, df.

df = pd.DataFrame()
df['Statute_Name'] = STATUTES
df = df[df['Statute_Name'].str.contains('\d{4}\s+([a-z]|[A-Z])') == False]

0 Domestic Gas and Electricity (Tariff Cap) Act ...
2 Northern Ireland Budget Act 2018
4 Haulage Permits and Trailer Registration Act 2018
6 Automated and Electric Vehicles Act 2018
8 Supply and Appropriation (Main Estimates) Act ...

Sample text to apply the legislation extractor against

text_block = """

Section 101 of the Criminal Justice Act 2003 is interesting, much like section 3 of the Fraud Act 2006.
The Police and Criminal Evidence Act 1984 is also a real favourite of mine.


Get matching statutues

To identify matching statutes, the list of statutes created from the scrape is interated over. The name of each statute form the basis of the expression, with matches stored in a list, MATCHES.


for statute in df['Statute_Name']:
    my_regex = re.escape(statute)
    match =, text_block)

    if match is not None:
        print (match[0])
Fraud Act 2006
Criminal Justice Act 2003
Police and Criminal Evidence Act 1984

Markup the matched statutes in the source text

The aim here is to enclose the captured statutes in <statute> tags. To do this, we need to make multiple substitutions in a single string on a single pass.

The first step is to cast the matches into a dictionary object, d, where the key is the match and the value is the replacement string.

d = {}

for match in MATCHES:
    opener = '<statute>'
    closer = '</statute>'
    replacement = opener + match + closer
    d[match] = replacement

print (d)
{'Fraud Act 2006': '<statute>Fraud Act 2006</statute>', 'Criminal Justice Act 2003': '<statute>Criminal Justice Act 2003</statute>', 'Police and Criminal Evidence Act 1984': '<statute>Police and Criminal Evidence Act 1984</statute>'}

The single pass substitution is handled in the following function, replace(). replace() takes two arguments: the source text and the dictionary of substitutions.

def replace(string, substitutions):

    substrings = sorted(substitutions, key=len, reverse=True)
    regex = re.compile('|'.join(map(re.escape, substrings)))
    return regex.sub(lambda match: substitutions[], string)

output = replace(text_block, d)
'Section 101 of the <statute>Criminal Justice Act 2003</statute> is interesting, much like section 3 of the <statute>Fraud Act 2006</statute>.\nThe <statute>Police and Criminal Evidence Act 1984</statute> is also a real favourite of mine.'

Read a file

Reading the contents of a file in Python is straightforward and there are a couple of nice methods that cater for different use cases.


Suppose we want to read a file called my_text.txt. First, we open the file:

f = open('my_text.txt', 'r')

We now have the file as an object, f


For most use cases, it's enough to simply read the entire contents of the file into a string. We can do this by using Python's read() method. 

content =
print (content)


Sometimes, you're going to want to deal with the file you're working with at line level. Fortunately, Python's readlines() method is available. The readlines() stores each line in the file to be read as an item in a list.

content = f.readlines()


There maybe times were you want to read a specific line in the file, which is what the readline() method can be used for. 

To access the first line in the file:

content = f.readline()

To access the second line in the file (remember Python is zero-indexed):

content = f.readline(1)

Connect to FTP site with Python

I was recently having a complete nightmare connecting to a FTP site using the FileZilla client and wanted to write a quick Python script to test the connection myself. 

Thanks to the ftplib module that comes with Python, a simple test was possible in only four lines of code.

Here's an example:

from ftplib import FTP

ftp = FTP('')
ftp.login(user='username', passwd='password')

# Get a list of the directories on the FTP site to test the connection work

The console will then print the structure of the FTP site's root folder.


[StackOverflow] how can I generate bigrams for words using NLTK python library?

A question popped up on Stack Overflow today asking using the NLTK library to tokenise text into bigrams. The question was as follows:

Suppose I want to generate bigrams for the word single Then the output should be a list ['si','in','ng','gl','le'].

I am new to language processing in python. Can someone guide me?

Tokenising text into n-grams using NLTK is pretty well documented and a whole raft of similar questions can be found on Stack Overflow. However, I think the question was marked as a duplicate a tad to hastily. 

Virtually all of the answers to n-gram related questions are directed against tokenising a string consisting of multiple words, e.g:

myString = "This is a string with nine words in it"

The string in the question consisted of only one word. The question was really about producing bigrams from the characters that make up a single word, which is a bit different. 

Here's one (not necessarily elegant) answer to the question:

import nltk

myString = 'single'

# Insert a space inbetween each character in myString

spaced = ''
for ch in myString:
    spaced = spaced + ch + ' '

# Generate bigrams out of the new spaced string

tokenized = spaced.split(" ")
myList = list(nltk.bigrams(tokenized))

# Join the items in each tuple in myList together and put them in a new list

Bigrams = []

for i in myList:
    Bigrams.append((''.join([w + ' ' for w in i])).strip())

print Bigrams

This will output:

['s i', 'i n', 'n g', 'g l', 'l e', 'e']

Loading own text data into Scikit

A quick note on how to load a custom text data set into Scikit-Learn. 

import sklearn
from sklearn import datasets
from pprint import pprint 

docs_to_train = sklearn.datasets.load_files("path/to/docs/to/train", description=None, categories=None, load_content=True, shuffle=True, encoding='utf-8', decode_error='strict', random_state=0)


Some useful links:


Converting XML to CSV

XMLutils is a neat little Python package for converting XML to various file formats like CSV and JSON. The particularly useful thing about is that it can be executed from the command line, which makes it quick and easy to start using. 


I installed XMLutils at the command line using:

sudo easy_install XMLutils

Using XMLutils at the command line

I had a sample XML file that I wanted to convert to CSV format. There was a lot in the XML file that I didn't really want going into the output CSV file, so I executed the following command:

$ xml2csv --input "/Users/danielhoadley/Library/Mobile Documents/com~apple~CloudDocs/Documents/Development/Stuff/Dockets/10519[DK]Davies_v_Davies_(msb)(sub-bpw).xml" --output "test.csv" --tag "CaseInfo" --ignore "CaseMain" "AllNCit" "AllECLI" "TempIxCardNo" "FullReportName" "AltName" "CaseJoint_IxCardNo_TempIxCardNo_FullReportName_CaseName" "LegalTopics" "Reportability"
  • xml2csv invokes the converter
  • Declare the input XML file with --input followed by the path to the file
  • Declare the output CSV file with --output followed by the path to output file
  • Declare the XML node that represents a record in the input file (in my case, the node was CaseInfo)

Running a command like this will be sufficient to do a straight conversion to CSV:

$ xml2csv --input "/Users/danielhoadley/Library/Mobile Documents/com~apple~CloudDocs/Documents/Development/Stuff/Dockets/10519[DK]Davies_v_Davies_(msb)(sub-bpw).xml" --output "test.csv" --tag "CaseInfo"

However, as I've said, there was quite a lot in the input file that I wanted to ignore. Ignoring tags is pretty straightforward: simply declare the tags you want to ignore after that --ignore flag, e.g:

--ignore "CaseMain" "AllNCit" "AllECLI" "TempIxCardNo" "FullReportName" "AltName" "CaseJoint_IxCardNo_TempIxCardNo_FullReportName_CaseName" "LegalTopics" "Reportability"


Remember to enclose the names of tags in quotes!

Convert text files to lower case in Python

When mining text it often helps to convert the entire text to lower case as part of your pre-processing stage. Here's an example of a Python script that does just that with a directory of files consisting of one or many text files:

import os
from itertools import chain
from glob import glob

directory = '/path/to/the/directory'

for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        f = open(filename, 'r')
        text =
        lines = [text.lower() for line in filename]
        with open(filename, 'w') as out:

Iterating over files with Python

A short block of code to demonstrate how to iterate over files in a directory and do some action with them.

import os
directory = 'the/directory/you/want/to/use'

for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        f = open(filename)
        lines =
        print (lines[10])

For example,

import gensim
import os
import logging
from gensim.summarization import summarize

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 

directory = 'the/directory/you/want/to/use'

for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        f = open(filename, 'r')
        text =
        print (summarize(text, word_count=20))

Note: something I've found with this is that you need to run the program from the directory defined in the directory variable (which is something I need to fix).

Overriding SSL verification in Python Launcher

I was running into a frustrating issue when trying to download a bunch of nltk test corpora in Python Launcher: the launcher kept saying it could verify my SSL certificate, which meant that I was unable to download the materials I wanted. 

It turns out that the way around this problem is as follows:

1. Create an unverified SSL context

>>> import ssl
>>> ssl._create_default_https_context = ssl._create_unverified_context

2. Then run the download

>>> import nltk

3. Download the packages in Launcher



Text Summarisation with Gensim

A quick rundown of summarising texts with Gensim in Python3. 

Ensure the gensim module is installed. Here's the code to summarise a single text file:

from gensim.summarization import summarize 
import logging 

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 

f = open("telegraph.txt","r") 
text = 
print(summarize(text, word_count=100)) 
print(summarize(text, ratio=0.5))

Test output of Donoghue v Stevenson with the output summary constrained to 100 words (I personally think the summariser has done an excellent job - it's calculated that the final paragraph of Lord Atkin's speech provides the best summary of the judgment!!):

My Lords, if your Lordships accept the  view that this pleading discloses a relevant cause of action you will be affirming the proposition that by Scots and English law alike a manufacturer of products, which he sells in such a form as to show that he intends them to reach the ultimate consumer in the form in which they left him with no reasonable possibility of intermediate examination, and with the knowledge that the absence of reasonable care in the preparation or putting up of the products will result in an injury to the consumer's life or property, owes a duty to the consumer to take that reasonable care.


I chained this summary into RAKE to run a quick keyword extraction over the summary. The RAKE parameters were as follows:

rake_object = rake.Rake("smartstoplist.txt", 5, 3, 4)

The output was a spot on extraction:

[('reasonable care', 4.0), ('consumer', 1.3333333333333333), ('products', 1.0)]

Newspaper on Python

Newspaper is an excellent Python module used for extracting and parsing newspaper articles. I took the module for a very quick test drive today and wanted to document my initial findings, primarily as an aide memoir.

Import Newspaper

Assuming Newspaper is installed as a Python module (in my case I'm using Newspaper3k on Python3), start off by importing the module:

import Newspaper

Set the target paper

In my test, I wanted to look at articles published in the Law section of the Guardian. The first step was to build the newspaper object, like so:

law ='')

To check the target was working, I passed size(), which gave an output of 438 articles.

Extract an article

For test purposes, I just wanted to extract a recent article, using the following code (this technically pulls down the second most recent article rather than the first, but somewhat confusingly, the result appears to be the most recent piece anyway!) :

first_article = law.articles[1]

The first line stores the first article in a variable called first_article. The second line downloads the article stored in that variable. 

Printing the result with print(first_article.html) just spews out the entire HTML to the console, which isn't very helpful. But, the brilliant thing about Newspaper is that it allows us to parse to article and then run some simple natural language processing against it. 

Parse the article

Now that we've downloaded the article, we're in a position to parse it:


This in turn enables us to target specific sections of the article, like the body text, title or author. Here's how to scrape body text:


This will print only the body text to the console. 

Write the body text to a file

The body text isn't that helpful to us sitting there in the console output, so let's write the output of first_article.text to a file:

First off, import sys

import sys


f = open( 'article.txt', 'w')