Delete (drop) a column from a Pandas dataframe

Here's how to delete a column from a Pandas dataframe (assume we aleady have a dataframe, df):

df.drop('name of column', axis=1)

The axis=1 argument simply signals that we want to delete a column as opposed to a row.

To delete multiple columns in one go, just pass a list in, like this:

df.drop(['name of column', 'name of another column', axis=1]

Marking up references to UK legislation in unstructured text

This is a quick example of a reliable (but not a particularly extensible) implementation for marking up references to UK primary legislation in an unstructured input text.

The implementation is reliable, because it uses a list of statute short titles derived from legislation.gov.uk and passes these into the source text as straightforward regular expression search patterns. This, however, makes the implementation a bit of a bore to easily extend, because the code does not account for new pieces of legislation added to the statute book since the code was written!

Set up enironment

import bs4 as BeautifulSoup
import urllib3
import pandas as pd
import csv
import re

http = urllib3.PoolManager()

Core objects and URL creation

Create objects that will be used throughout the script and build a list of target legislation.gov.uk URLs.

TARGETS = []
STATUTES = []

#This is a little bit hacky. I'm essentially running an empty search on
#legistlation.gov.uk and iterating over each page in the results.
scope = range(1,289)
base_url = "http://www.legislation.gov.uk/primary?"

for year in scope:
    target_url = base_url + "page=" + str(year)
    TARGETS.append(target_url )

Perform the scrape

Scrape each target, pulling in the required text content from the legislation.gov.uk results table.

for target in TARGETS:
    response = http.request('GET', target)
    soup = BeautifulSoup.BeautifulSoup(response.data, "html.parser")

    td = soup.find_all('td')
    for i in td:

        children = i.findChildren("a" , recursive=True)
        for child in children:
            statute_name = child.text
            STATUTES.append(statute_name)
STATUTES[:5]

Output >>>

['Domestic Gas and Electricity (Tariff Cap) Act 2018',
 '2018\xa0c. 21',
 'Northern Ireland Budget Act 2018',
 '2018\xa0c. 20',
 'Haulage Permits and Trailer Registration Act 2018']

Clean the captured data and store it

The scrape pulls in unwanted material in the form of chapter numbers owing to lack of precision in the source markup. Unwanted captures are dropped using a regular expression and the data is stored in a pd.DataFrame, df.

df = pd.DataFrame()
df['Statute_Name'] = STATUTES
df = df[df['Statute_Name'].str.contains('\d{4}\s+([a-z]|[A-Z])') == False]
df.to_csv('statutes.csv')
df.head()

Statute_Name
0 Domestic Gas and Electricity (Tariff Cap) Act ...
2 Northern Ireland Budget Act 2018
4 Haulage Permits and Trailer Registration Act 2018
6 Automated and Electric Vehicles Act 2018
8 Supply and Appropriation (Main Estimates) Act ...

Sample text to apply the legislation extractor against

text_block = """

Section 101 of the Criminal Justice Act 2003 is interesting, much like section 3 of the Fraud Act 2006.
The Police and Criminal Evidence Act 1984 is also a real favourite of mine.

"""

Get matching statutues

To identify matching statutes, the list of statutes created from the scrape is interated over. The name of each statute form the basis of the expression, with matches stored in a list, MATCHES.

MATCHES = []

for statute in df['Statute_Name']:
    my_regex = re.escape(statute)
    match = re.search(my_regex, text_block)

    if match is not None:
        MATCHES.append(match[0])
        print (match[0])
Fraud Act 2006
Criminal Justice Act 2003
Police and Criminal Evidence Act 1984

Markup the matched statutes in the source text

The aim here is to enclose the captured statutes in <statute> tags. To do this, we need to make multiple substitutions in a single string on a single pass.

The first step is to cast the matches into a dictionary object, d, where the key is the match and the value is the replacement string.

d = {}

for match in MATCHES:
    opener = '<statute>'
    closer = '</statute>'
    replacement = opener + match + closer
    d[match] = replacement

print (d)
{'Fraud Act 2006': '<statute>Fraud Act 2006</statute>', 'Criminal Justice Act 2003': '<statute>Criminal Justice Act 2003</statute>', 'Police and Criminal Evidence Act 1984': '<statute>Police and Criminal Evidence Act 1984</statute>'}

The single pass substitution is handled in the following function, replace(). replace() takes two arguments: the source text and the dictionary of substitutions.

def replace(string, substitutions):

    substrings = sorted(substitutions, key=len, reverse=True)
    regex = re.compile('|'.join(map(re.escape, substrings)))
    return regex.sub(lambda match: substitutions[match.group(0)], string)

output = replace(text_block, d)
str(output).strip('\n')
'Section 101 of the <statute>Criminal Justice Act 2003</statute> is interesting, much like section 3 of the <statute>Fraud Act 2006</statute>.\nThe <statute>Police and Criminal Evidence Act 1984</statute> is also a real favourite of mine.'

Find child elements using BeautifulSoup

Suppose you're attempting to scrape a slab of HTML that looks a bit like this:

<tr class="oddRow">
  <td>
  <a href="/ukpga/2018/21/contents/enacted">Domestic Gas and Electricity (Tariff Cap) Act 2018</a>
  </td>
  <td>
    <a href="/ukpga/2018/21/contents/enacted">2018 c. 21</a>
  </td>
  <td>UK Public General Acts</td>
</tr>
<tr>
  <td>
    <a href="/ukpga/2018/20/contents/enacted">Northern Ireland Budget Act 2018</a>
  </td>
  <td>
    <a href="/ukpga/2018/20/contents/enacted">2018 c. 20</a>
  </td>

The bit you're looking to scrape is contained in <a> tag that sits as a child of the <td> tag, i.e. Northern Ireland Budget Act 2018.

Now, for all you know, there are going to be <a> elements all over the page, many of which you have no interest in. Because of this, something like stuff = soup.find_all('a') is no good.

What you really need to do is limit your scrape to only those <a> tags that have a <td> tags as its parent.

Here's how you do it:

td = soup.find_all('td') # Find all the td elements on the page

    for i in td:  

        # call .findChildren() on each item in the td list

        children = i.findChildren("a" , recursive=True)

        # Iterate over the list of children calling accessing the .text attribute on each child

        for child in children:
            what_i_want = child.text

Find rows containing specific values in a Pandas Dataframe

Suppose we have Pandas dataframe, df, with a column called Name.

The data looks in the column look like this:

Name
----
Statute of Westminster, The First (1275)
1275 c. 5
Statute of Marlborough 1267 [Waste]
1267 c. 23
The Statute of Marlborough 1267 [Distress]
1267 c. 1

If we wanted to exclude all rows that contain something like 1275 c.5 from the dataframe, we can use Pandas str.contains() method in combination with a regular expression, like so:

df = df[df['Statute_Name'].str.contains('\d{4}\s+([a-z]|[A-Z])') == False]

This is basically saying that df is now equal to all of the rows in df where a match on our regular expression returns False.

Setting == True would have the opposite effect; retaining rows that match the expressions and dropping those that do not.

List Comprehensions

One particularly nifty feature of the Python language are list comprehensions. List comprehensions provide a really concise way of constructing lists. 

One, perfectly acceptable way to construct a new list in Python would be as follows:

x = range(10)
myList = [] # this is an empty list

for i in x:
    myList.append(i)

List comprehensions give us a way to achieve the same result with far few lines of code, like so:

x = range(10)
myList = [i for i in x]

The use of a list comprehension in this simple example removes the need to (a) create an empty list object and (b) construct a for loop block to iterate over the object whose items we are a seeking to add to the empty list.

Write contents of list to column in Pandas dataframe

Let's say we have the following list:

myList = ['Wolf', 'Fox', 'Shark']

To write the contents of myList to a column in a Pandas dataframe, we'd do the following:

import pandas as pd # Bring in Pandas

myList = ['Wolf', 'Fox', 'Shark'] 

# Create empty dataframe with a column named 'Animals'

df = pd.DataFrame(columns=['Animals'])

# Write myList to the Animals column

df['Animals'] = myList

Query Neo4j graph database with Requests

This snippet provides a very quick example of how to send a very simple CYPHER query to Neo4j with Python:

import requests

url = "http://localhost:7474/db/data/cypher"

payload = "{ \"query\" : \"MATCH p=()-[r:CONSIDERED_BY]->() RETURN p LIMIT 25\",\n  \"params\" : { } }"
headers = {
    'authorization': "Basic bmVvNGo6Zm9vYmFy",
    'cache-control': "no-cache"
    }

response = requests.request("POST", url, data=payload, headers=headers)

print(response.text)

Note that the 'authorization': "Basic bmVvNGo6Zm9vYmFy" bit in the headers {} section is a Base64 encoded representation of the username and password of my local Neo4j instance: neo4j:foobar

You can encode your own Neo4j username:password combination here: https://www.base64encode.org/

Regex to find references to legislation in a block of text

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(?=((?<![A-Z][a-z])(([A-Z][a-z]+[\s-][A-Z][a-z]*\s)(Act|Order))(\s\d{4})|(([A-Z][a-z]+[\s-][A-Z][a-z]+[\s-][A-Z][a-z]*\s)(Act|Order))|(([A-Z][a-z]+[\s-][A-Z][a-z]+[\s-][A-Z][a-z]+[\s-][A-Z][a-z]*\s)(Act|Order))\s\d{4}))"

test_str = "The claimant was released on licence after serving part of an extended sentence of imprisonment. Subsequently the Secretary of State revoked the claimant’s licence and recalled him to prison pursuant to section 254 of the Criminal Justice Act 2003[1] on the grounds that he had breached two of the conditions of his licence.  Police And Criminal Evidence Act 1984The Secretary of State referred the matter to the Parole Board, providing it with a dossier which contained among other things material which had been prepared for, but not used in, the claimant’s trial in the Crown Court. The material contained allegations of a number of further offences in relation to which the claimant had not been convicted, no indictment in relation to them having ever been pursued. The claimant, relying upon the guidance contained in paragraph 2 of Appendix Q to Chapter 8 of Prison Service Order 6000[2], submitted to the board that since the material contained pre-trial prosecution evidence it ought to be excluded from the dossier placed before the panel of the board responsible for considering his release. The board determined that it had no power to exclude the material and that it would be for the panel to determine questions of relevance and weight in relation to it. The claimant sought judicial review of the board’s decision and the Human Rights Act 1998. "

matches = re.finditer(regex, test_str)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

Change a character in a string

Let's say you have the following string (which just so happens to be a URL, which isn't important):

https://www.iclr.co.ui/ic/1981007549

There's a problem with this URL, the .uk component of the address has mistakenly been entered as .ui. Let's fix this with Python.

First, we take the URL and pass it as a string into the list() function.

s = list('https://www.iclr.co.ui/ic/1981007549)

Then, we identify where the offending character (in this case, i) appears (it's 22nd character). Bearing in mind that Python is zero-indexed, we can target it as the 21st item in the list and change the value to k like so:

s[21] = 'k'

No that the character has been corrected, we can join() the list back together to form the corrected string:

s = "".join(s)

Load JSON files into Elasticsearch

The following example provides a simple guide for loading JSON files into Elasticsearch using the official elasticsearch API in Python.

Import dependencies

import requests, json, os
from elasticsearch import Elasticsearch

Set the path to the directory containing the JSON files to be loaded

directory = '/path/to/files/'

Connect to the Elasticsearch server

By default, the Elasticsearch instance will listen on port 9200

res = requests.get('http://localhost:9200')
print (res.content)
es = Elasticsearch([{'host': 'localhost', 'port': '9200'}])

Create an index value object

Because I want Elasticsearch to use a bog-standard integer at the unique _id for each document being loaded, I'm setting this up now outside the for loop I'm going to use to interate over the JSON files for loading into Elasticsearch

i = 1

Iterate over each JSON file and load it into Elasticsearch

for filename in os.listdir(directory):
    if filename.endswith(".json"):
        f = open(filename)
        docket_content = f.read()

Load each file into an Elasticsearch index

The following line is the line that's actually sending the content into Elasticsearch.

es.index(index='myindex', ignore=400, doc_type='docket', id=i, body=json.loads(docket_content))

        i = i + 1

There are a few things worth pointing out here:

  • index= is the name of the index we're creating, this can be anything you like
  • ignore=400 is flagging that I want to loader to ignore instances in which Elasticsearch is complaining about the format of any of the fields in the source JSON data (date fields, I get the feeling, are a commom offender here). I just want to Elasticsearch to receive the data as is without second guessing it.
  • doc_type is just a label we're assigning to each document being loaded
  • id=i is the unique index value being assigned to each document as it's loaded. You can leave this out as Elasticsearch will apply its own id.

Convert XML to JSON

I recently needed to converse a large collection of fairly complex XML files into JSON format to make them ready for leading into an Elasticsearch cluster.

After a bit of searching around, I found a really nice Python library called xmltodict, which worked perfectly.

You can download xmltodict using pip:

$ pip3 install xmltodict

xmltodict worked well with all the XML files I threw at it, so I'm not going to talk about the form and structure of the XML I was working with. The code below merely assumes that you have directory consisting of XML files and that you'd like to convert them to JSON format and write them to a new JSON file.

import xmltodict, json
import os

#path to the folder holding the XML
directory = '/path/to/folder'

#iterate over the XML files in the folder
for filename in os.listdir(directory):
    if filename.endswith(".xml"):
        f = open(filename)

        XML_content = f.read()

        #parse the content of each file using xmltodict
        x = xmltodict.parse(XML_content)
        j = json.dumps(x)

        print (filename)

        filename = filename.replace('.xml', '')
        output_file = open(filename + '.json', 'w')
        output_file.write(j)

 

 

Calculate Mean, median and mode

Generate some random data to work with

First, we'll generate some random numeric data to work with in the form of a Numpy array

import numpy as np

data = np.random.randint(100, size=(3, 5))

Calculate the mean

We can use Numpy's np.mean() function to calculate the mean value in the range of values in the data array:

mean = np.mean(data)
print ("The mean value of the dataset is", mean)
Out: The mean value of the dataset is 61.333333333333336

Calculate the median

Numpy also has a np.median function, which is deployed like this:

median = np.median(data)
print ("The median value of the dataset is", median)
Out: The median value of the dataset is 80.0

Calculate the mode

Numpy doesn't have a built-in function to calculate the modal value within a range of values, so use the stats module from the scipy package.

from scipy import stats

mode = stats.mode(data)
print("The mode is {}".format(mode[0]))
Out: The mode is [[55  6 56 35  7]]

Generate some random data with Numpy

Numpy, Python's pre-eminent mathematics library, makes it easy to generate random numeric data in the form of two-dimension arrays.

Here's one way of generating some random numeric data with Numpy.

Import Numpy

When importing numpy the standard practice is to import is as np.

import numpy as np

Build some random data in the form of a numpy array

Suppose I want to generate some random numeric data that consists of integers ranging from 0 to 99 in the form of an array that has 3 rows and five columns.

data = np.random.randint(100, size=(3,5)`

This will yield an array that looks something like this:

array([[82,  7, 92, 35,  7],
       [81, 71, 56, 80, 82],
       [55,  6, 96, 87, 83]])