Marking up references to UK legislation in unstructured text

This is a quick example of a reliable (but not a particularly extensible) implementation for marking up references to UK primary legislation in an unstructured input text.

The implementation is reliable, because it uses a list of statute short titles derived from legislation.gov.uk and passes these into the source text as straightforward regular expression search patterns. This, however, makes the implementation a bit of a bore to easily extend, because the code does not account for new pieces of legislation added to the statute book since the code was written!

Set up enironment

import bs4 as BeautifulSoup
import urllib3
import pandas as pd
import csv
import re

http = urllib3.PoolManager()

Core objects and URL creation

Create objects that will be used throughout the script and build a list of target legislation.gov.uk URLs.

TARGETS = []
STATUTES = []

#This is a little bit hacky. I'm essentially running an empty search on
#legistlation.gov.uk and iterating over each page in the results.
scope = range(1,289)
base_url = "http://www.legislation.gov.uk/primary?"

for year in scope:
    target_url = base_url + "page=" + str(year)
    TARGETS.append(target_url )

Perform the scrape

Scrape each target, pulling in the required text content from the legislation.gov.uk results table.

for target in TARGETS:
    response = http.request('GET', target)
    soup = BeautifulSoup.BeautifulSoup(response.data, "html.parser")

    td = soup.find_all('td')
    for i in td:

        children = i.findChildren("a" , recursive=True)
        for child in children:
            statute_name = child.text
            STATUTES.append(statute_name)
STATUTES[:5]

Output >>>

['Domestic Gas and Electricity (Tariff Cap) Act 2018',
 '2018\xa0c. 21',
 'Northern Ireland Budget Act 2018',
 '2018\xa0c. 20',
 'Haulage Permits and Trailer Registration Act 2018']

Clean the captured data and store it

The scrape pulls in unwanted material in the form of chapter numbers owing to lack of precision in the source markup. Unwanted captures are dropped using a regular expression and the data is stored in a pd.DataFrame, df.

df = pd.DataFrame()
df['Statute_Name'] = STATUTES
df = df[df['Statute_Name'].str.contains('\d{4}\s+([a-z]|[A-Z])') == False]
df.to_csv('statutes.csv')
df.head()

Statute_Name
0 Domestic Gas and Electricity (Tariff Cap) Act ...
2 Northern Ireland Budget Act 2018
4 Haulage Permits and Trailer Registration Act 2018
6 Automated and Electric Vehicles Act 2018
8 Supply and Appropriation (Main Estimates) Act ...

Sample text to apply the legislation extractor against

text_block = """

Section 101 of the Criminal Justice Act 2003 is interesting, much like section 3 of the Fraud Act 2006.
The Police and Criminal Evidence Act 1984 is also a real favourite of mine.

"""

Get matching statutues

To identify matching statutes, the list of statutes created from the scrape is interated over. The name of each statute form the basis of the expression, with matches stored in a list, MATCHES.

MATCHES = []

for statute in df['Statute_Name']:
    my_regex = re.escape(statute)
    match = re.search(my_regex, text_block)

    if match is not None:
        MATCHES.append(match[0])
        print (match[0])
Fraud Act 2006
Criminal Justice Act 2003
Police and Criminal Evidence Act 1984

Markup the matched statutes in the source text

The aim here is to enclose the captured statutes in <statute> tags. To do this, we need to make multiple substitutions in a single string on a single pass.

The first step is to cast the matches into a dictionary object, d, where the key is the match and the value is the replacement string.

d = {}

for match in MATCHES:
    opener = '<statute>'
    closer = '</statute>'
    replacement = opener + match + closer
    d[match] = replacement

print (d)
{'Fraud Act 2006': '<statute>Fraud Act 2006</statute>', 'Criminal Justice Act 2003': '<statute>Criminal Justice Act 2003</statute>', 'Police and Criminal Evidence Act 1984': '<statute>Police and Criminal Evidence Act 1984</statute>'}

The single pass substitution is handled in the following function, replace(). replace() takes two arguments: the source text and the dictionary of substitutions.

def replace(string, substitutions):

    substrings = sorted(substitutions, key=len, reverse=True)
    regex = re.compile('|'.join(map(re.escape, substrings)))
    return regex.sub(lambda match: substitutions[match.group(0)], string)

output = replace(text_block, d)
str(output).strip('\n')
'Section 101 of the <statute>Criminal Justice Act 2003</statute> is interesting, much like section 3 of the <statute>Fraud Act 2006</statute>.\nThe <statute>Police and Criminal Evidence Act 1984</statute> is also a real favourite of mine.'