Suppose you have a Pandas dataframe, df
, and in one of your columns, Are you a cat?
, you have a slew of NaN
values that you'd like to replace with the string No
. Here's how to deal with that:
df['Are you a Cat?'].fillna('No', inplace=True)
A collection of rough ideas and development notes.
Suppose you have a Pandas dataframe, df
, and in one of your columns, Are you a cat?
, you have a slew of NaN
values that you'd like to replace with the string No
. Here's how to deal with that:
df['Are you a Cat?'].fillna('No', inplace=True)
Suppose you have a Pandas dataframe, df
, and you want to rename the column named Cats
to Dogs
. Here's how to quickly deal with this:
df = df.rename(columns={'Cats': 'Dogs'})
Here's how to delete a column from a Pandas dataframe (assume we aleady have a dataframe, df
):
df.drop('name of column', axis=1)
The axis=1
argument simply signals that we want to delete a column as opposed to a row.
To delete multiple columns in one go, just pass a list in, like this:
df.drop(['name of column', 'name of another column', axis=1]
This is a quick example of a reliable (but not a particularly extensible) implementation for marking up references to UK primary legislation in an unstructured input text.
The implementation is reliable, because it uses a list of statute short titles derived from legislation.gov.uk and passes these into the source text as straightforward regular expression search patterns. This, however, makes the implementation a bit of a bore to easily extend, because the code does not account for new pieces of legislation added to the statute book since the code was written!
import bs4 as BeautifulSoup
import urllib3
import pandas as pd
import csv
import re
http = urllib3.PoolManager()
Create objects that will be used throughout the script and build a list of target legislation.gov.uk
URLs.
TARGETS = []
STATUTES = []
#This is a little bit hacky. I'm essentially running an empty search on
#legistlation.gov.uk and iterating over each page in the results.
scope = range(1,289)
base_url = "http://www.legislation.gov.uk/primary?"
for year in scope:
target_url = base_url + "page=" + str(year)
TARGETS.append(target_url )
Scrape each target, pulling in the required text content from the legislation.gov.uk
results table.
for target in TARGETS:
response = http.request('GET', target)
soup = BeautifulSoup.BeautifulSoup(response.data, "html.parser")
td = soup.find_all('td')
for i in td:
children = i.findChildren("a" , recursive=True)
for child in children:
statute_name = child.text
STATUTES.append(statute_name)
STATUTES[:5]
Output >>>
['Domestic Gas and Electricity (Tariff Cap) Act 2018',
'2018\xa0c. 21',
'Northern Ireland Budget Act 2018',
'2018\xa0c. 20',
'Haulage Permits and Trailer Registration Act 2018']
The scrape pulls in unwanted material in the form of chapter numbers owing to lack of precision in the source markup. Unwanted captures are dropped using a regular expression and the data is stored in a pd.DataFrame
, df
.
df = pd.DataFrame()
df['Statute_Name'] = STATUTES
df = df[df['Statute_Name'].str.contains('\d{4}\s+([a-z]|[A-Z])') == False]
df.to_csv('statutes.csv')
df.head()
Statute_Name | |
---|---|
0 | Domestic Gas and Electricity (Tariff Cap) Act ... |
2 | Northern Ireland Budget Act 2018 |
4 | Haulage Permits and Trailer Registration Act 2018 |
6 | Automated and Electric Vehicles Act 2018 |
8 | Supply and Appropriation (Main Estimates) Act ... |
text_block = """
Section 101 of the Criminal Justice Act 2003 is interesting, much like section 3 of the Fraud Act 2006.
The Police and Criminal Evidence Act 1984 is also a real favourite of mine.
"""
To identify matching statutes, the list of statutes created from the scrape is interated over. The name of each statute form the basis of the expression, with matches stored in a list, MATCHES
.
MATCHES = []
for statute in df['Statute_Name']:
my_regex = re.escape(statute)
match = re.search(my_regex, text_block)
if match is not None:
MATCHES.append(match[0])
print (match[0])
Fraud Act 2006
Criminal Justice Act 2003
Police and Criminal Evidence Act 1984
The aim here is to enclose the captured statutes in <statute>
tags. To do this, we need to make multiple substitutions in a single string on a single pass.
The first step is to cast the matches into a dictionary object, d
, where the key is the match and the value is the replacement string.
d = {}
for match in MATCHES:
opener = '<statute>'
closer = '</statute>'
replacement = opener + match + closer
d[match] = replacement
print (d)
{'Fraud Act 2006': '<statute>Fraud Act 2006</statute>', 'Criminal Justice Act 2003': '<statute>Criminal Justice Act 2003</statute>', 'Police and Criminal Evidence Act 1984': '<statute>Police and Criminal Evidence Act 1984</statute>'}
The single pass substitution is handled in the following function, replace()
. replace()
takes two arguments: the source text and the dictionary of substitutions.
def replace(string, substitutions):
substrings = sorted(substitutions, key=len, reverse=True)
regex = re.compile('|'.join(map(re.escape, substrings)))
return regex.sub(lambda match: substitutions[match.group(0)], string)
output = replace(text_block, d)
str(output).strip('\n')
'Section 101 of the <statute>Criminal Justice Act 2003</statute> is interesting, much like section 3 of the <statute>Fraud Act 2006</statute>.\nThe <statute>Police and Criminal Evidence Act 1984</statute> is also a real favourite of mine.'
Here's how to read a CSV file into a list:
with open('statutes.csv', 'r') as f:
reader = csv.reader(f)
print (f)
your_list = list(reader)
Suppose you're attempting to scrape a slab of HTML that looks a bit like this:
<tr class="oddRow"> <td> <a href="/ukpga/2018/21/contents/enacted">Domestic Gas and Electricity (Tariff Cap) Act 2018</a> </td> <td> <a href="/ukpga/2018/21/contents/enacted">2018 c. 21</a> </td> <td>UK Public General Acts</td> </tr> <tr> <td> <a href="/ukpga/2018/20/contents/enacted">Northern Ireland Budget Act 2018</a> </td> <td> <a href="/ukpga/2018/20/contents/enacted">2018 c. 20</a> </td>
The bit you're looking to scrape is contained in <a>
tag that sits as a child of the <td>
tag, i.e. Northern Ireland Budget Act 2018
.
Now, for all you know, there are going to be <a>
elements all over the page, many of which you have no interest in. Because of this, something like stuff = soup.find_all('a')
is no good.
What you really need to do is limit your scrape to only those <a>
tags that have a <td>
tags as its parent.
Here's how you do it:
td = soup.find_all('td') # Find all the td elements on the page
for i in td:
# call .findChildren() on each item in the td list
children = i.findChildren("a" , recursive=True)
# Iterate over the list of children calling accessing the .text attribute on each child
for child in children:
what_i_want = child.text
Suppose we have Pandas dataframe, df
, with a column called Name
.
The data looks in the column look like this:
Name
----
Statute of Westminster, The First (1275)
1275 c. 5
Statute of Marlborough 1267 [Waste]
1267 c. 23
The Statute of Marlborough 1267 [Distress]
1267 c. 1
If we wanted to exclude all rows that contain something like 1275 c.5
from the dataframe, we can use Pandas str.contains()
method in combination with a regular expression, like so:
df = df[df['Statute_Name'].str.contains('\d{4}\s+([a-z]|[A-Z])') == False]
This is basically saying that df
is now equal to all of the rows in df
where a match on our regular expression returns False
.
Setting == True
would have the opposite effect; retaining rows that match the expressions and dropping those that do not.
One particularly nifty feature of the Python language are list comprehensions. List comprehensions provide a really concise way of constructing lists.
One, perfectly acceptable way to construct a new list in Python would be as follows:
x = range(10)
myList = [] # this is an empty list
for i in x:
myList.append(i)
List comprehensions give us a way to achieve the same result with far few lines of code, like so:
x = range(10)
myList = [i for i in x]
The use of a list comprehension in this simple example removes the need to (a) create an empty list object and (b) construct a for loop block to iterate over the object whose items we are a seeking to add to the empty list.
This blog post by Angelo Basile provides a really clear and quick run through of creating virtual environments for use in a Jupyter notebook.
https://anbasile.github.io/programming/2017/06/25/jupyter-venv/
Let's say we have the following list:
myList = ['Wolf', 'Fox', 'Shark']
To write the contents of myList
to a column in a Pandas dataframe, we'd do the following:
import pandas as pd # Bring in Pandas myList = ['Wolf', 'Fox', 'Shark'] # Create empty dataframe with a column named 'Animals' df = pd.DataFrame(columns=['Animals']) # Write myList to the Animals column df['Animals'] = myList
To create a completely empty Pandas dataframe, we use do the following:
import pandas as pd MyEmptydf = pd.DataFrame()
This will create an empty dataframe with no columns or rows.
To create an empty dataframe with three empty column (columns X, Y and Z), we do:
df = pd.DataFrame(columns=['X', 'Y', 'Z'])
Let's say we have a list of stuff, e.g:
myList = ['apples', 'pears', 'bananas',]
We can see that the list currently has three items: apples, pears and bananas. If we want to count the number of items in the list programmatically (as opposed to manually counting the items, which would be daft), we can do this:
In [1]: len(myList) Out [2]: 3
This snippet provides a very quick example of how to send a very simple CYPHER query to Neo4j with Python:
import requests url = "http://localhost:7474/db/data/cypher" payload = "{ \"query\" : \"MATCH p=()-[r:CONSIDERED_BY]->() RETURN p LIMIT 25\",\n \"params\" : { } }" headers = { 'authorization': "Basic bmVvNGo6Zm9vYmFy", 'cache-control': "no-cache" } response = requests.request("POST", url, data=payload, headers=headers) print(response.text)
Note that the 'authorization': "Basic bmVvNGo6Zm9vYmFy"
bit in the headers {}
section is a Base64 encoded representation of the username and password of my local Neo4j instance: neo4j:foobar
You can encode your own Neo4j username:password combination here: https://www.base64encode.org/
A few of rough drafts I put together for an initial idea for a winter Tube station campaign for the Incorporated Council of Law Reporting.
Each image is a poster-style representation of a significant English case.
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?=((?<![A-Z][a-z])(([A-Z][a-z]+[\s-][A-Z][a-z]*\s)(Act|Order))(\s\d{4})|(([A-Z][a-z]+[\s-][A-Z][a-z]+[\s-][A-Z][a-z]*\s)(Act|Order))|(([A-Z][a-z]+[\s-][A-Z][a-z]+[\s-][A-Z][a-z]+[\s-][A-Z][a-z]*\s)(Act|Order))\s\d{4}))"
test_str = "The claimant was released on licence after serving part of an extended sentence of imprisonment. Subsequently the Secretary of State revoked the claimant’s licence and recalled him to prison pursuant to section 254 of the Criminal Justice Act 2003[1] on the grounds that he had breached two of the conditions of his licence. Police And Criminal Evidence Act 1984The Secretary of State referred the matter to the Parole Board, providing it with a dossier which contained among other things material which had been prepared for, but not used in, the claimant’s trial in the Crown Court. The material contained allegations of a number of further offences in relation to which the claimant had not been convicted, no indictment in relation to them having ever been pursued. The claimant, relying upon the guidance contained in paragraph 2 of Appendix Q to Chapter 8 of Prison Service Order 6000[2], submitted to the board that since the material contained pre-trial prosecution evidence it ought to be excluded from the dossier placed before the panel of the board responsible for considering his release. The board determined that it had no power to exclude the material and that it would be for the panel to determine questions of relevance and weight in relation to it. The claimant sought judicial review of the board’s decision and the Human Rights Act 1998. "
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
Let's say you have the following string (which just so happens to be a URL, which isn't important):
https://www.iclr.co.ui/ic/1981007549
There's a problem with this URL, the .uk
component of the address has mistakenly been entered as .ui
. Let's fix this with Python.
First, we take the URL and pass it as a string into the list()
function.
s = list('https://www.iclr.co.ui/ic/1981007549)
Then, we identify where the offending character (in this case, i
) appears (it's 22nd character). Bearing in mind that Python is zero-indexed, we can target it as the 21st item in the list and change the value to k
like so:
s[21] = 'k'
No that the character has been corrected, we can join()
the list back together to form the corrected string:
s = "".join(s)
The following example provides a simple guide for loading JSON files into Elasticsearch using the official elasticsearch
API in Python.
import requests, json, os
from elasticsearch import Elasticsearch
directory = '/path/to/files/'
By default, the Elasticsearch instance will listen on port 9200
res = requests.get('http://localhost:9200')
print (res.content)
es = Elasticsearch([{'host': 'localhost', 'port': '9200'}])
Because I want Elasticsearch to use a bog-standard integer at the unique _id
for each document being loaded, I'm setting this up now outside the for loop I'm going to use to interate over the JSON files for loading into Elasticsearch
i = 1
for filename in os.listdir(directory):
if filename.endswith(".json"):
f = open(filename)
docket_content = f.read()
The following line is the line that's actually sending the content into Elasticsearch.
es.index(index='myindex', ignore=400, doc_type='docket', id=i, body=json.loads(docket_content))
i = i + 1
There are a few things worth pointing out here:
index=
is the name of the index we're creating, this can be anything you likeignore=400
is flagging that I want to loader to ignore instances in which Elasticsearch is complaining about the format of any of the fields in the source JSON data (date fields, I get the feeling, are a commom offender here). I just want to Elasticsearch to receive the data as is without second guessing it.doc_type
is just a label we're assigning to each document being loadedid=i
is the unique index value being assigned to each document as it's loaded. You can leave this out as Elasticsearch will apply its own id
.I recently needed to converse a large collection of fairly complex XML files into JSON format to make them ready for leading into an Elasticsearch cluster.
After a bit of searching around, I found a really nice Python library called xmltodict
, which worked perfectly.
You can download xmltodict
using pip
:
$ pip3 install xmltodict
xmltodict
worked well with all the XML files I threw at it, so I'm not going to talk about the form and structure of the XML I was working with. The code below merely assumes that you have directory consisting of XML files and that you'd like to convert them to JSON format and write them to a new JSON file.
import xmltodict, json
import os
#path to the folder holding the XML
directory = '/path/to/folder'
#iterate over the XML files in the folder
for filename in os.listdir(directory):
if filename.endswith(".xml"):
f = open(filename)
XML_content = f.read()
#parse the content of each file using xmltodict
x = xmltodict.parse(XML_content)
j = json.dumps(x)
print (filename)
filename = filename.replace('.xml', '')
output_file = open(filename + '.json', 'w')
output_file.write(j)
First, we'll generate some random numeric data to work with in the form of a Numpy array
import numpy as np
data = np.random.randint(100, size=(3, 5))
mean
We can use Numpy's np.mean()
function to calculate the mean value in the range of values in the data
array:
mean = np.mean(data)
print ("The mean value of the dataset is", mean)
Out: The mean value of the dataset is 61.333333333333336
median
Numpy also has a np.median
function, which is deployed like this:
median = np.median(data)
print ("The median value of the dataset is", median)
Out: The median value of the dataset is 80.0
mode
Numpy doesn't have a built-in function to calculate the modal value within a range of values, so use the stats module from the scipy
package.
from scipy import stats
mode = stats.mode(data)
print("The mode is {}".format(mode[0]))
Out: The mode is [[55 6 56 35 7]]
Numpy, Python's pre-eminent mathematics library, makes it easy to generate random numeric data in the form of two-dimension arrays.
Here's one way of generating some random numeric data with Numpy.
Numpy
When importing numpy
the standard practice is to import is as np
.
import numpy as np
numpy
arraySuppose I want to generate some random numeric data that consists of integers ranging from 0 to 99 in the form of an array that has 3 rows and five columns.
data = np.random.randint(100, size=(3,5)`
This will yield an array that looks something like this:
array([[82, 7, 92, 35, 7],
[81, 71, 56, 80, 82],
[55, 6, 96, 87, 83]])