Regex

Marking up references to UK legislation in unstructured text

This is a quick example of a reliable (but not a particularly extensible) implementation for marking up references to UK primary legislation in an unstructured input text.

The implementation is reliable, because it uses a list of statute short titles derived from legislation.gov.uk and passes these into the source text as straightforward regular expression search patterns. This, however, makes the implementation a bit of a bore to easily extend, because the code does not account for new pieces of legislation added to the statute book since the code was written!

Set up enironment

import bs4 as BeautifulSoup
import urllib3
import pandas as pd
import csv
import re

http = urllib3.PoolManager()

Core objects and URL creation

Create objects that will be used throughout the script and build a list of target legislation.gov.uk URLs.

TARGETS = []
STATUTES = []

#This is a little bit hacky. I'm essentially running an empty search on
#legistlation.gov.uk and iterating over each page in the results.
scope = range(1,289)
base_url = "http://www.legislation.gov.uk/primary?"

for year in scope:
    target_url = base_url + "page=" + str(year)
    TARGETS.append(target_url )

Perform the scrape

Scrape each target, pulling in the required text content from the legislation.gov.uk results table.

for target in TARGETS:
    response = http.request('GET', target)
    soup = BeautifulSoup.BeautifulSoup(response.data, "html.parser")

    td = soup.find_all('td')
    for i in td:

        children = i.findChildren("a" , recursive=True)
        for child in children:
            statute_name = child.text
            STATUTES.append(statute_name)
STATUTES[:5]

Output >>>

['Domestic Gas and Electricity (Tariff Cap) Act 2018',
 '2018\xa0c. 21',
 'Northern Ireland Budget Act 2018',
 '2018\xa0c. 20',
 'Haulage Permits and Trailer Registration Act 2018']

Clean the captured data and store it

The scrape pulls in unwanted material in the form of chapter numbers owing to lack of precision in the source markup. Unwanted captures are dropped using a regular expression and the data is stored in a pd.DataFrame, df.

df = pd.DataFrame()
df['Statute_Name'] = STATUTES
df = df[df['Statute_Name'].str.contains('\d{4}\s+([a-z]|[A-Z])') == False]
df.to_csv('statutes.csv')
df.head()

Statute_Name
0 Domestic Gas and Electricity (Tariff Cap) Act ...
2 Northern Ireland Budget Act 2018
4 Haulage Permits and Trailer Registration Act 2018
6 Automated and Electric Vehicles Act 2018
8 Supply and Appropriation (Main Estimates) Act ...

Sample text to apply the legislation extractor against

text_block = """

Section 101 of the Criminal Justice Act 2003 is interesting, much like section 3 of the Fraud Act 2006.
The Police and Criminal Evidence Act 1984 is also a real favourite of mine.

"""

Get matching statutues

To identify matching statutes, the list of statutes created from the scrape is interated over. The name of each statute form the basis of the expression, with matches stored in a list, MATCHES.

MATCHES = []

for statute in df['Statute_Name']:
    my_regex = re.escape(statute)
    match = re.search(my_regex, text_block)

    if match is not None:
        MATCHES.append(match[0])
        print (match[0])
Fraud Act 2006
Criminal Justice Act 2003
Police and Criminal Evidence Act 1984

Markup the matched statutes in the source text

The aim here is to enclose the captured statutes in <statute> tags. To do this, we need to make multiple substitutions in a single string on a single pass.

The first step is to cast the matches into a dictionary object, d, where the key is the match and the value is the replacement string.

d = {}

for match in MATCHES:
    opener = '<statute>'
    closer = '</statute>'
    replacement = opener + match + closer
    d[match] = replacement

print (d)
{'Fraud Act 2006': '<statute>Fraud Act 2006</statute>', 'Criminal Justice Act 2003': '<statute>Criminal Justice Act 2003</statute>', 'Police and Criminal Evidence Act 1984': '<statute>Police and Criminal Evidence Act 1984</statute>'}

The single pass substitution is handled in the following function, replace(). replace() takes two arguments: the source text and the dictionary of substitutions.

def replace(string, substitutions):

    substrings = sorted(substitutions, key=len, reverse=True)
    regex = re.compile('|'.join(map(re.escape, substrings)))
    return regex.sub(lambda match: substitutions[match.group(0)], string)

output = replace(text_block, d)
str(output).strip('\n')
'Section 101 of the <statute>Criminal Justice Act 2003</statute> is interesting, much like section 3 of the <statute>Fraud Act 2006</statute>.\nThe <statute>Police and Criminal Evidence Act 1984</statute> is also a real favourite of mine.'

Find rows containing specific values in a Pandas Dataframe

Suppose we have Pandas dataframe, df, with a column called Name.

The data looks in the column look like this:

Name
----
Statute of Westminster, The First (1275)
1275 c. 5
Statute of Marlborough 1267 [Waste]
1267 c. 23
The Statute of Marlborough 1267 [Distress]
1267 c. 1

If we wanted to exclude all rows that contain something like 1275 c.5 from the dataframe, we can use Pandas str.contains() method in combination with a regular expression, like so:

df = df[df['Statute_Name'].str.contains('\d{4}\s+([a-z]|[A-Z])') == False]

This is basically saying that df is now equal to all of the rows in df where a match on our regular expression returns False.

Setting == True would have the opposite effect; retaining rows that match the expressions and dropping those that do not.

Regex to find references to legislation in a block of text

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(?=((?<![A-Z][a-z])(([A-Z][a-z]+[\s-][A-Z][a-z]*\s)(Act|Order))(\s\d{4})|(([A-Z][a-z]+[\s-][A-Z][a-z]+[\s-][A-Z][a-z]*\s)(Act|Order))|(([A-Z][a-z]+[\s-][A-Z][a-z]+[\s-][A-Z][a-z]+[\s-][A-Z][a-z]*\s)(Act|Order))\s\d{4}))"

test_str = "The claimant was released on licence after serving part of an extended sentence of imprisonment. Subsequently the Secretary of State revoked the claimant’s licence and recalled him to prison pursuant to section 254 of the Criminal Justice Act 2003[1] on the grounds that he had breached two of the conditions of his licence.  Police And Criminal Evidence Act 1984The Secretary of State referred the matter to the Parole Board, providing it with a dossier which contained among other things material which had been prepared for, but not used in, the claimant’s trial in the Crown Court. The material contained allegations of a number of further offences in relation to which the claimant had not been convicted, no indictment in relation to them having ever been pursued. The claimant, relying upon the guidance contained in paragraph 2 of Appendix Q to Chapter 8 of Prison Service Order 6000[2], submitted to the board that since the material contained pre-trial prosecution evidence it ought to be excluded from the dossier placed before the panel of the board responsible for considering his release. The board determined that it had no power to exclude the material and that it would be for the panel to determine questions of relevance and weight in relation to it. The claimant sought judicial review of the board’s decision and the Human Rights Act 1998. "

matches = re.finditer(regex, test_str)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

Regex to match upper case words

Here's a short demonstration of how to use a regular expression to identify UPPERCASE words in a bunch of text files. 

The goal in this particular snip is to open and read all of the .rtf files in a given directory and identify only the UPPERCASE words appearing in the file.

import os
import re

directory = '/path/to/files'

regex = r"\b[A-Z][A-Z]+\b"

for filename in os.listdir(directory):
    if filename.endswith(".rtf"):
        with open(filename, 'r') as f:
            transcript = f.read()
            matches = re.finditer(regex, transcript)
            for match in matches:
                print (match[0])