Data

Delete (drop) a column from a Pandas dataframe

Here's how to delete a column from a Pandas dataframe (assume we aleady have a dataframe, df):

df.drop('name of column', axis=1)

The axis=1 argument simply signals that we want to delete a column as opposed to a row.

To delete multiple columns in one go, just pass a list in, like this:

df.drop(['name of column', 'name of another column', axis=1]

Marking up references to UK legislation in unstructured text

This is a quick example of a reliable (but not a particularly extensible) implementation for marking up references to UK primary legislation in an unstructured input text.

The implementation is reliable, because it uses a list of statute short titles derived from legislation.gov.uk and passes these into the source text as straightforward regular expression search patterns. This, however, makes the implementation a bit of a bore to easily extend, because the code does not account for new pieces of legislation added to the statute book since the code was written!

Set up enironment

import bs4 as BeautifulSoup
import urllib3
import pandas as pd
import csv
import re

http = urllib3.PoolManager()

Core objects and URL creation

Create objects that will be used throughout the script and build a list of target legislation.gov.uk URLs.

TARGETS = []
STATUTES = []

#This is a little bit hacky. I'm essentially running an empty search on
#legistlation.gov.uk and iterating over each page in the results.
scope = range(1,289)
base_url = "http://www.legislation.gov.uk/primary?"

for year in scope:
    target_url = base_url + "page=" + str(year)
    TARGETS.append(target_url )

Perform the scrape

Scrape each target, pulling in the required text content from the legislation.gov.uk results table.

for target in TARGETS:
    response = http.request('GET', target)
    soup = BeautifulSoup.BeautifulSoup(response.data, "html.parser")

    td = soup.find_all('td')
    for i in td:

        children = i.findChildren("a" , recursive=True)
        for child in children:
            statute_name = child.text
            STATUTES.append(statute_name)
STATUTES[:5]

Output >>>

['Domestic Gas and Electricity (Tariff Cap) Act 2018',
 '2018\xa0c. 21',
 'Northern Ireland Budget Act 2018',
 '2018\xa0c. 20',
 'Haulage Permits and Trailer Registration Act 2018']

Clean the captured data and store it

The scrape pulls in unwanted material in the form of chapter numbers owing to lack of precision in the source markup. Unwanted captures are dropped using a regular expression and the data is stored in a pd.DataFrame, df.

df = pd.DataFrame()
df['Statute_Name'] = STATUTES
df = df[df['Statute_Name'].str.contains('\d{4}\s+([a-z]|[A-Z])') == False]
df.to_csv('statutes.csv')
df.head()

Statute_Name
0 Domestic Gas and Electricity (Tariff Cap) Act ...
2 Northern Ireland Budget Act 2018
4 Haulage Permits and Trailer Registration Act 2018
6 Automated and Electric Vehicles Act 2018
8 Supply and Appropriation (Main Estimates) Act ...

Sample text to apply the legislation extractor against

text_block = """

Section 101 of the Criminal Justice Act 2003 is interesting, much like section 3 of the Fraud Act 2006.
The Police and Criminal Evidence Act 1984 is also a real favourite of mine.

"""

Get matching statutues

To identify matching statutes, the list of statutes created from the scrape is interated over. The name of each statute form the basis of the expression, with matches stored in a list, MATCHES.

MATCHES = []

for statute in df['Statute_Name']:
    my_regex = re.escape(statute)
    match = re.search(my_regex, text_block)

    if match is not None:
        MATCHES.append(match[0])
        print (match[0])
Fraud Act 2006
Criminal Justice Act 2003
Police and Criminal Evidence Act 1984

Markup the matched statutes in the source text

The aim here is to enclose the captured statutes in <statute> tags. To do this, we need to make multiple substitutions in a single string on a single pass.

The first step is to cast the matches into a dictionary object, d, where the key is the match and the value is the replacement string.

d = {}

for match in MATCHES:
    opener = '<statute>'
    closer = '</statute>'
    replacement = opener + match + closer
    d[match] = replacement

print (d)
{'Fraud Act 2006': '<statute>Fraud Act 2006</statute>', 'Criminal Justice Act 2003': '<statute>Criminal Justice Act 2003</statute>', 'Police and Criminal Evidence Act 1984': '<statute>Police and Criminal Evidence Act 1984</statute>'}

The single pass substitution is handled in the following function, replace(). replace() takes two arguments: the source text and the dictionary of substitutions.

def replace(string, substitutions):

    substrings = sorted(substitutions, key=len, reverse=True)
    regex = re.compile('|'.join(map(re.escape, substrings)))
    return regex.sub(lambda match: substitutions[match.group(0)], string)

output = replace(text_block, d)
str(output).strip('\n')
'Section 101 of the <statute>Criminal Justice Act 2003</statute> is interesting, much like section 3 of the <statute>Fraud Act 2006</statute>.\nThe <statute>Police and Criminal Evidence Act 1984</statute> is also a real favourite of mine.'

Find child elements using BeautifulSoup

Suppose you're attempting to scrape a slab of HTML that looks a bit like this:

<tr class="oddRow">
  <td>
  <a href="/ukpga/2018/21/contents/enacted">Domestic Gas and Electricity (Tariff Cap) Act 2018</a>
  </td>
  <td>
    <a href="/ukpga/2018/21/contents/enacted">2018 c. 21</a>
  </td>
  <td>UK Public General Acts</td>
</tr>
<tr>
  <td>
    <a href="/ukpga/2018/20/contents/enacted">Northern Ireland Budget Act 2018</a>
  </td>
  <td>
    <a href="/ukpga/2018/20/contents/enacted">2018 c. 20</a>
  </td>

The bit you're looking to scrape is contained in <a> tag that sits as a child of the <td> tag, i.e. Northern Ireland Budget Act 2018.

Now, for all you know, there are going to be <a> elements all over the page, many of which you have no interest in. Because of this, something like stuff = soup.find_all('a') is no good.

What you really need to do is limit your scrape to only those <a> tags that have a <td> tags as its parent.

Here's how you do it:

td = soup.find_all('td') # Find all the td elements on the page

    for i in td:  

        # call .findChildren() on each item in the td list

        children = i.findChildren("a" , recursive=True)

        # Iterate over the list of children calling accessing the .text attribute on each child

        for child in children:
            what_i_want = child.text

Query Neo4j graph database with Requests

This snippet provides a very quick example of how to send a very simple CYPHER query to Neo4j with Python:

import requests

url = "http://localhost:7474/db/data/cypher"

payload = "{ \"query\" : \"MATCH p=()-[r:CONSIDERED_BY]->() RETURN p LIMIT 25\",\n  \"params\" : { } }"
headers = {
    'authorization': "Basic bmVvNGo6Zm9vYmFy",
    'cache-control': "no-cache"
    }

response = requests.request("POST", url, data=payload, headers=headers)

print(response.text)

Note that the 'authorization': "Basic bmVvNGo6Zm9vYmFy" bit in the headers {} section is a Base64 encoded representation of the username and password of my local Neo4j instance: neo4j:foobar

You can encode your own Neo4j username:password combination here: https://www.base64encode.org/

Regex to find references to legislation in a block of text

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(?=((?<![A-Z][a-z])(([A-Z][a-z]+[\s-][A-Z][a-z]*\s)(Act|Order))(\s\d{4})|(([A-Z][a-z]+[\s-][A-Z][a-z]+[\s-][A-Z][a-z]*\s)(Act|Order))|(([A-Z][a-z]+[\s-][A-Z][a-z]+[\s-][A-Z][a-z]+[\s-][A-Z][a-z]*\s)(Act|Order))\s\d{4}))"

test_str = "The claimant was released on licence after serving part of an extended sentence of imprisonment. Subsequently the Secretary of State revoked the claimant’s licence and recalled him to prison pursuant to section 254 of the Criminal Justice Act 2003[1] on the grounds that he had breached two of the conditions of his licence.  Police And Criminal Evidence Act 1984The Secretary of State referred the matter to the Parole Board, providing it with a dossier which contained among other things material which had been prepared for, but not used in, the claimant’s trial in the Crown Court. The material contained allegations of a number of further offences in relation to which the claimant had not been convicted, no indictment in relation to them having ever been pursued. The claimant, relying upon the guidance contained in paragraph 2 of Appendix Q to Chapter 8 of Prison Service Order 6000[2], submitted to the board that since the material contained pre-trial prosecution evidence it ought to be excluded from the dossier placed before the panel of the board responsible for considering his release. The board determined that it had no power to exclude the material and that it would be for the panel to determine questions of relevance and weight in relation to it. The claimant sought judicial review of the board’s decision and the Human Rights Act 1998. "

matches = re.finditer(regex, test_str)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

Create a Gensim Corpus for text files in a local directory

This snippet creates a Gensim corpus from text files stored in a local directory:

import os, gensim

def iter_documents(top_directory):
    """Iterate over all documents, yielding a document (=list of utf8 tokens) at a time."""
    for root, dirs, files in os.walk(top_directory):
        for file in filter(lambda file: file.endswith('.txt'), files):
            document = open(os.path.join(root, file)).read() # read the entire document, as one big string
            yield gensim.utils.tokenize(document, lower=True) # or whatever tokenization suits you

class MyCorpus(object):
    def __init__(self, top_dir):
        self.top_dir = top_dir
        self.dictionary = gensim.corpora.Dictionary(iter_documents(top_dir))
        self.dictionary.filter_extremes(no_below=1, keep_n=30000) # check API docs for pruning params

    def __iter__(self):
        for tokens in iter_documents(self.top_dir):
            yield self.dictionary.doc2bow(tokens)

corpus = MyCorpus('/path/to/files') # create a dictionary
for vector in corpus: # convert each document to a bag-of-word vector
    print (vector)

Strip XML tags out of file

This is a quick and dirty example of using a regular expression to remove XML tags from an an XML file. 

Suppose we have the following XML, sample.xml:

<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

What we want to do is use Python to strip out the XML element tags, so that we're left with something like this:

Tove
Jani
Reminder
Don't forget me this weekend!

Here's how to do it:

import re

text = re.sub('<[^<]+>', "", open("sample.xml").read())
with open("output.txt", "w") as f:
    f.write(text)

Using NLTK to remove stopwords from a text file

Text processing invariably requires that some words in the source corpus be removed before moving on to more complex tasks (such as keyword extraction, summarisation and topic modelling).

The sorts of words to be removed will typically include words that do not of themselves confer much semantic value (e.g. the, it, a, etc). The task in hand may also require additional, specialist words to be removed. This example uses NLTK to bring in a list of core English stopwords and then adds additional custom stopwords to the list. 

from nltk.corpus import stopwords

# Bring in the default English NLTK stop words
stoplist = stopwords.words('english')

# Define additional stopwords in a string
additional_stopwords = """case judge judgment court"""

# Split the the additional stopwords string on each word and then add
# those words to the NLTK stopwords list
stoplist += additional_stopwords.split()

# Open a file and read it into memory
file = open('sample.txt')
text = file.read()

# Apply the stoplist to the text
clean = [word for word in text.split() if word not in stoplist]

It's worth looking at a couple of discreet aspects of this code to see what's going on. 

The stoplist object is storing the NLTK English stopwords as a list:

stoplist = stopwords.words('english')

print (stoplist)

>>> ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your'...]

Then, we're adding our additional stopwords as individual tokens in a string object, additional_stopwords, and using split() to break that string down into individual tokens in a list object:

stoplist += additional_stopwords.split()

The above line of code updates the original stoplist object with the additional stopwords.

The text being passed in is a simple text file, which reads:

this is a case in which a judge sat down on a chair

When we pass the text through our generator, our output is:

print (clean)

>>> ['sat', 'chair']

 

 

 

 

Calculating cosine similarity between documents

This script calculates the cosine similarity between several text documents. At scale, this method can be used to identify similar documents within a larger corpus.

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

# Bring in standard stopwords
stopWords = stopwords.words('english')

print ("\nCalculating document similarity scores...")

# Open and read a bunch of files 
f = open('/Users/Daniel/Documents/Development/Transcript-Data/Test/admin/0102HD41.txt')
doc1 = str(f.read())

f = open('/Users/Daniel/Documents/Development/Transcript-Data/Test/admin/0107HD42.txt')
doc2 = str(f.read())

f = open('/Users/Daniel/Documents/Development/Transcript-Data/Test/admin/0107HD40.txt')
doc3 = str(f.read())

# Create a string to use to test the similarity scoring

train_string = 'By these proceedings for judicial review the Claimant seeks to challenge the decision of the Defendant dated the 23rd of May 2014 refusing the Claimant’s application of the 3rd of January 2012 for naturalisation as a British citizen'

# Construct the training set as a list
train_set = [train_string, doc1, doc2, doc3]

# Set up the vectoriser, passing in the stop words
tfidf_vectorizer = TfidfVectorizer(stop_words=stopWords)

# Apply the vectoriser to the training set
tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_set)

# Print the score
print ("\nSimilarity Score [*] ",cosine_similarity(tfidf_matrix_train[0:1], tfidf_matrix_train))<p>Hello, World!</p>

Converting Dates in Python

Dates are one of those annoying things that can be expressed in many different formats. For example, the date at the time of writing is 28 June 2017. I can express today's date in any number of ways, including:

  • June 28, 2017 
  • 28/06/2017
  • 28/06/17
  • 28/6/17
  • 2017-06-28

When you're working with dates in your Python projects, chances are you'll eventually need to wrangle a date in one format into another, so here's an example I came across in my own code recently.

I was parsing RSS feeds published by the British and Irish Legal Information Institute. In the feeds, dates are expressed as, for example, '28 June 2017'. For my purposes, I needed to convert the date into YYYY-mm-dd format (e.g. 2017-06-28). Here's how I dealt with it:

I was capturing the date like so:

date_in_feed = '28 June 2017'

I then set up the converter, which uses strptime, passing in the date I've captured as the first argument and it's format as the second argument:

converter = time.strptime(date_in_feed, '%d %B %Y')

Finally, to get the date the way I want, I use strpftime, passing in the desired format as the first argument and my converted (above) as the second argument.

converted_date = time.strftime('%Y-%m-%d', converter)

In the light of the myriad of alternative date structures you may be dealing with, I think you're best off looking at the datetime documentation here, but I'll add more examples as and when I come across them.

Building a Scatterplot with Pandas and Seaborn

Pandas and Seaborn go together like lemon and lime. In the code below, we're using Pandas to construct a dataframe from a CSV file and Seaborn (which sits on top of matplotlib and makes it look a million times better) is handling the visualisation end of things.

The dataframe consists of three columns, passiveYearactiveYear and Vala where:

activeYear = the year of a case that considered an earlier case

passiveYear = the year of a case which has itself been considered

Vala = the type of consideration the active case meted out against the passive case.

Code

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Parse CSV, bring in the columns and drop null values
df = pd.read_csv('hoad.csv', usecols=['Vala', 'passiveYear', 'activeYear']).dropna()

# Build a grid consisting of a chart for each Vala type

grid = sns.FacetGrid(df, col="Vala", hue="Vala", col_wrap=3, size=3)

# Draw a horizontal line to show the rough midway point along the y axis 
grid.map(plt.axhline, y=1907, ls=":", c=".5")

# Plot where x=active year and y=passiveyear
grid.map(plt.scatter, "activeYear", "passiveYear", marker="o", alpha=0.5)

# Adjust the tick positions and labels
grid.set(xticks=[1800,2015], yticks=[1800,2015],
         xlim=(1955, 2015), ylim=(1800, 2015))

# Adjust the arrangement of the plots
grid.fig.tight_layout(w_pad=1)

plt.show()

This code yields the following visualisation:

Naive Bayes Document Classifier with Scikit-Learn

The following code demonstrates a relatively simple example of a Naive Bayes classifier applied to a small batch of case law.

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
import numpy as np
from sklearn import datasets
from pprint import pprint
from sklearn.model_selection import train_test_split
from sklearn import svm

# Declare the categories
categories = ['Crime', 'Family']

# Load the dataset
docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/Documents/Development/Python/Test_Data", description=None, categories=categories,
                                            load_content=True, shuffle=True, encoding='utf-8', decode_error='strict', random_state=0)

train_X, test_X, train_y, test_y = train_test_split(docs_to_train.data,
                               docs_to_train.target,
                               test_size = 3)
print (len(docs_to_train.data))

print (train_X)

# Vectorise the dataset

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(docs_to_train.data)

# Fit the estimator and transform the vector to tf-idf

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape


tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

# Train the naive Bayes classifier

clf = MultinomialNB().fit(X_train_tfidf, docs_to_train.target)

docs_new = ['The defendant used a knife.', 'This court will protect vulnerable adults', 'The appellant was sentenced to seven years']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

# Print the results

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, docs_to_train.target_names[category]))

This renders the following output:

'The defendant used a knife.' => Crime
'This court will protect vulnerable adults' => Family
'The appellant was sentenced to seven years' => Crime

Tokenize text file into sentences with Python

I recently needed to split a document into sentences in a way that handled most, if not all, of the annoying edge cases. After a frustrating period trying to get a snippet I found on Stackoverflow to work, I finally figured it out:

import nltk.data
import codecs
import os

doc = codecs.open('path/to/text/file/text.txt', 'r' 'utf-8')
content = doc.read()

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

print ('\n-----\n'.join(tokenizer.tokenize(content)))

Using grep to search in files

I recently needed to find out which files in a directory contained a particular string of text. After fiddling around with a Python script for far longer than I ought to have done, I realised I could achieve my goal using grep at the command line. 

For example, say I have a folder full of text files and I want to know which of them contain the string 'CRIMINAL DIVISION', the following command will bring back the list I need at the console:

grep -R "CRIMINAL DIVISION" *.txt 

Loading own text data into Scikit

A quick note on how to load a custom text data set into Scikit-Learn. 

import sklearn
from sklearn import datasets
from pprint import pprint 

docs_to_train = sklearn.datasets.load_files("path/to/docs/to/train", description=None, categories=None, load_content=True, shuffle=True, encoding='utf-8', decode_error='strict', random_state=0)

pprint(list(docs_to_train.target_names))

Some useful links:

http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html

http://scikit-learn.org/stable/datasets/twenty_newsgroups.html