Text processing

Marking up references to UK legislation in unstructured text

This is a quick example of a reliable (but not a particularly extensible) implementation for marking up references to UK primary legislation in an unstructured input text.

The implementation is reliable, because it uses a list of statute short titles derived from legislation.gov.uk and passes these into the source text as straightforward regular expression search patterns. This, however, makes the implementation a bit of a bore to easily extend, because the code does not account for new pieces of legislation added to the statute book since the code was written!

Set up enironment

import bs4 as BeautifulSoup
import urllib3
import pandas as pd
import csv
import re

http = urllib3.PoolManager()

Core objects and URL creation

Create objects that will be used throughout the script and build a list of target legislation.gov.uk URLs.


#This is a little bit hacky. I'm essentially running an empty search on
#legistlation.gov.uk and iterating over each page in the results.
scope = range(1,289)
base_url = "http://www.legislation.gov.uk/primary?"

for year in scope:
    target_url = base_url + "page=" + str(year)
    TARGETS.append(target_url )

Perform the scrape

Scrape each target, pulling in the required text content from the legislation.gov.uk results table.

for target in TARGETS:
    response = http.request('GET', target)
    soup = BeautifulSoup.BeautifulSoup(response.data, "html.parser")

    td = soup.find_all('td')
    for i in td:

        children = i.findChildren("a" , recursive=True)
        for child in children:
            statute_name = child.text

Output >>>

['Domestic Gas and Electricity (Tariff Cap) Act 2018',
 '2018\xa0c. 21',
 'Northern Ireland Budget Act 2018',
 '2018\xa0c. 20',
 'Haulage Permits and Trailer Registration Act 2018']

Clean the captured data and store it

The scrape pulls in unwanted material in the form of chapter numbers owing to lack of precision in the source markup. Unwanted captures are dropped using a regular expression and the data is stored in a pd.DataFrame, df.

df = pd.DataFrame()
df['Statute_Name'] = STATUTES
df = df[df['Statute_Name'].str.contains('\d{4}\s+([a-z]|[A-Z])') == False]

0 Domestic Gas and Electricity (Tariff Cap) Act ...
2 Northern Ireland Budget Act 2018
4 Haulage Permits and Trailer Registration Act 2018
6 Automated and Electric Vehicles Act 2018
8 Supply and Appropriation (Main Estimates) Act ...

Sample text to apply the legislation extractor against

text_block = """

Section 101 of the Criminal Justice Act 2003 is interesting, much like section 3 of the Fraud Act 2006.
The Police and Criminal Evidence Act 1984 is also a real favourite of mine.


Get matching statutues

To identify matching statutes, the list of statutes created from the scrape is interated over. The name of each statute form the basis of the expression, with matches stored in a list, MATCHES.


for statute in df['Statute_Name']:
    my_regex = re.escape(statute)
    match = re.search(my_regex, text_block)

    if match is not None:
        print (match[0])
Fraud Act 2006
Criminal Justice Act 2003
Police and Criminal Evidence Act 1984

Markup the matched statutes in the source text

The aim here is to enclose the captured statutes in <statute> tags. To do this, we need to make multiple substitutions in a single string on a single pass.

The first step is to cast the matches into a dictionary object, d, where the key is the match and the value is the replacement string.

d = {}

for match in MATCHES:
    opener = '<statute>'
    closer = '</statute>'
    replacement = opener + match + closer
    d[match] = replacement

print (d)
{'Fraud Act 2006': '<statute>Fraud Act 2006</statute>', 'Criminal Justice Act 2003': '<statute>Criminal Justice Act 2003</statute>', 'Police and Criminal Evidence Act 1984': '<statute>Police and Criminal Evidence Act 1984</statute>'}

The single pass substitution is handled in the following function, replace(). replace() takes two arguments: the source text and the dictionary of substitutions.

def replace(string, substitutions):

    substrings = sorted(substitutions, key=len, reverse=True)
    regex = re.compile('|'.join(map(re.escape, substrings)))
    return regex.sub(lambda match: substitutions[match.group(0)], string)

output = replace(text_block, d)
'Section 101 of the <statute>Criminal Justice Act 2003</statute> is interesting, much like section 3 of the <statute>Fraud Act 2006</statute>.\nThe <statute>Police and Criminal Evidence Act 1984</statute> is also a real favourite of mine.'

Use a variable in a regular expression

Sometimes you want to pass an object into a regular expression rather than explicitly state the pattern you're looking to match against.

An example of when you might want to do this is when you have a list of words and you want to iterate over a text and look for matches against the words in that list. 

Here's how this is done:

import re

subject = "In the room women come and go, talking of Michelangelo."

words = ['room', 'talking', 'Michelangelo']

for word in words:
    my_regex = r"\b(?=\w)" + re.escape(word) + r"\b(?!\w)"
    if re.search(my_regex, subject, re.IGNORECASE):
        print (word, ' found in the subject')

Find most common words in a corpus

This little sample demonstrates several basic text processing steps with a corpus of text files stored in a local directory. 

  • First, we read the corpus of text files into a list
  • Second, we knock out unwanted stuff, like things that aren't actually words and words that only consist of a single character
  • Third, we use standard NLTK stopwords and a list of custom stopwords, to strip out noise from the corpus
  • Finally, we use NLTK to calculate the most common words in each file in the corpus
from __future__ import division
import glob
from nltk.corpus import stopwords
from nltk import *
import re

# Bring in the default English NLTK stop words
stoplist = stopwords.words('english')

# Define additional stopwords in a string
additional_stopwords = """case law lawful judge judgment court mr justice would evidence mr order 
defendant act decision make two london appeal section lord one applicant mr. may could said also application whether 
made time first r miss give appellant november give fact within point sentence question upon matter 
leave part given must notice public state taken course cannot circumstances j that, offence set 
behalf however plaintiff see set say secretary regard - v claim right appeared second put e way material
view relation effect take might particular however, present court) october b reasons basis far 
referred trial found lord, land consider authority subject necessary considered 0171 see,s 
council think legal shall respect ground three case, crown without 2 relevant and, special business told clear
paragraph person account letter therefore jury th solicitor use years mrs mr provision discretion
matters respondent concerned cases defence reason issue well count argument facts gave proceedings 
position period needs approved used power us limited even either exercise counsel applicants submission
although counsel submitted st need appellants plaintiffs policy thomas making tribunal action entitled affadavit
december strand daniel transcript smith purpose refused offence offences general counts terms grounds conclusion number reasonable 
prosecution home hearing seems defendants educational clarke solicitors criminal following accept place come
already accepted required words local l;ater january provided stage report street september day sought greenwood
rather service accounts page hobhouse courts march third wilcock mind result months came learned appropriate date instructed
form division notes july went bernal official review principle consideration affidavit held lordship another dr different
notes quite royal possible instructed shorthand development amount has months wc respondents took clearly since find
satisfied members later fleet took interest parties name change information co sum ec done provisions party hd paid

# Split the the additional stopwords string on each word and then add
# those words to the NLTK stopwords list
stoplist += additional_stopwords.split()

# Define the files that make up the corpus to be modelled

file_list = glob.glob(os.path.join(os.getcwd(), '/Users/danielhoadley/PycharmProjects/topicvis', '*.txt'))

# Construct an empty list into which the content of each file will be stored as a item

corpus = []

# Read the files

for file_path in file_list:
    with open(file_path) as f_input:
        content = f_input.read()
        only_words = re.sub("[^a-zA-Z]", " ", content) # Remove anything that isn't a 'word'
        no_single = re.sub(r'(?:^| )\w(?:$| )', ' ', only_words).strip() # Remove any words consisting of a single character

# Remove stopwords

texts = [[word for word in document.lower().split() if word not in stoplist] for document in corpus]

# Get the most common words in each text

for text in texts:
    fdist = FreqDist(text)
    print (fdist.most_common(2))

Strip XML tags out of file

This is a quick and dirty example of using a regular expression to remove XML tags from an an XML file. 

Suppose we have the following XML, sample.xml:

<body>Don't forget me this weekend!</body>

What we want to do is use Python to strip out the XML element tags, so that we're left with something like this:

Don't forget me this weekend!

Here's how to do it:

import re

text = re.sub('<[^<]+>', "", open("sample.xml").read())
with open("output.txt", "w") as f:

Using NLTK to remove stopwords from a text file

Text processing invariably requires that some words in the source corpus be removed before moving on to more complex tasks (such as keyword extraction, summarisation and topic modelling).

The sorts of words to be removed will typically include words that do not of themselves confer much semantic value (e.g. the, it, a, etc). The task in hand may also require additional, specialist words to be removed. This example uses NLTK to bring in a list of core English stopwords and then adds additional custom stopwords to the list. 

from nltk.corpus import stopwords

# Bring in the default English NLTK stop words
stoplist = stopwords.words('english')

# Define additional stopwords in a string
additional_stopwords = """case judge judgment court"""

# Split the the additional stopwords string on each word and then add
# those words to the NLTK stopwords list
stoplist += additional_stopwords.split()

# Open a file and read it into memory
file = open('sample.txt')
text = file.read()

# Apply the stoplist to the text
clean = [word for word in text.split() if word not in stoplist]

It's worth looking at a couple of discreet aspects of this code to see what's going on. 

The stoplist object is storing the NLTK English stopwords as a list:

stoplist = stopwords.words('english')

print (stoplist)

>>> ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your'...]

Then, we're adding our additional stopwords as individual tokens in a string object, additional_stopwords, and using split() to break that string down into individual tokens in a list object:

stoplist += additional_stopwords.split()

The above line of code updates the original stoplist object with the additional stopwords.

The text being passed in is a simple text file, which reads:

this is a case in which a judge sat down on a chair

When we pass the text through our generator, our output is:

print (clean)

>>> ['sat', 'chair']





Calculating cosine similarity between documents

This script calculates the cosine similarity between several text documents. At scale, this method can be used to identify similar documents within a larger corpus.

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

# Bring in standard stopwords
stopWords = stopwords.words('english')

print ("\nCalculating document similarity scores...")

# Open and read a bunch of files 
f = open('/Users/Daniel/Documents/Development/Transcript-Data/Test/admin/0102HD41.txt')
doc1 = str(f.read())

f = open('/Users/Daniel/Documents/Development/Transcript-Data/Test/admin/0107HD42.txt')
doc2 = str(f.read())

f = open('/Users/Daniel/Documents/Development/Transcript-Data/Test/admin/0107HD40.txt')
doc3 = str(f.read())

# Create a string to use to test the similarity scoring

train_string = 'By these proceedings for judicial review the Claimant seeks to challenge the decision of the Defendant dated the 23rd of May 2014 refusing the Claimant’s application of the 3rd of January 2012 for naturalisation as a British citizen'

# Construct the training set as a list
train_set = [train_string, doc1, doc2, doc3]

# Set up the vectoriser, passing in the stop words
tfidf_vectorizer = TfidfVectorizer(stop_words=stopWords)

# Apply the vectoriser to the training set
tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_set)

# Print the score
print ("\nSimilarity Score [*] ",cosine_similarity(tfidf_matrix_train[0:1], tfidf_matrix_train))<p>Hello, World!</p>

Create inverted index of sentences in text files

The purpose of this script is to identify instances in which any given sentence in any given document within a corpus appears in other documents within the corpus

For example, suppose we have a corpus of three text documents (document_xdocument_y and document_z). Each document consists of one or many sentences. 

Document_x consists of multiple sentences, one of which reads:

This sentence is about apples.

Another sentence in document_x reads:

However, unlike the other quoted sentence, this sentence is about oranges.

It just so happens that document_y contains the following sentence, amongst others:

This sentence is about apples.

We could construct an index where the sentence is the key and the files in which it appears is the value, for example:

Sentence Files
This sentence is about apples. document_x, document_y
However, unlike the other quoted sentence, this sentence is about oranges. document_x

The following code example performs this indexing task. For an added twist, the resulting dictionaries are written to a MongoDB database on my local machine.

## (c) 2017. Daniel Hoadley
## Tokenize text files into sentences and writes the output to MongoDB as key(sentence)/value(source filename) pairs.
## Start mongodb instance first with ./mongod

import json
from pymongo import MongoClient
import nltk.data
import codecs
import os
from nltk.tokenize import sent_tokenize

## Run clean.py before executing this script.

## Connect to MongoDB instance and create new database/collection

client = MongoClient('localhost', 27017)
db = client['test-database']
collection = db['sentences']

# Create empty dictionary object

d = {}

# Read the source files

directory = '/Users/danielhoadley/Documents/Development/Python/regex'

for filename in os.listdir(directory):
    if filename.endswith('.cln'):
        source = codecs.open(filename, 'r', 'utf-8')
        content = source.read()
        name = source.name

# Tokenise the source file into sentences

        tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
        sents = sent_tokenize(content)
        print content
        print sents
# Deduplicate the list of sentences to remove instances where a sentence appears multiple times in the same case
        deduped_sents = list(set(sents))

# Clean the list because mongo doesn't like fullpoints in the key

        clean_sents = map(lambda each:each.strip(u'.'),deduped_sents)
        fresh_sents = map(lambda each:each.strip(),clean_sents)
        cleaned = [word.replace(':', '.') for word in fresh_sents]

        # Populate the empty dictionary with the sentences as keys and the filename as a value

        for i in cleaned:
            d.setdefault(i, []).append(name)

# Remove keys that are less than 50 characters in length

for k in d.keys():
    if len(k) <= 50:
        del d[k]

# Iterate over the dictionary and write each key/value pair to MongoDB as an object

for key, value in d.iteritems():
    sentence_id = db.sentences.insert_one({'sentence': key, 'files': value})

# Dump the output to the console so I can eyeball it

print json.dumps(d.items(), sort_keys=True, indent=4) # output the dictionary as prettified json

print '\nSentences extracted and written to MongoDB!\n'

Convert text files to lower case in Python

When mining text it often helps to convert the entire text to lower case as part of your pre-processing stage. Here's an example of a Python script that does just that with a directory of files consisting of one or many text files:

import os
from itertools import chain
from glob import glob

directory = '/path/to/the/directory'

for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        f = open(filename, 'r')
        text = f.read()
        lines = [text.lower() for line in filename]
        with open(filename, 'w') as out: