Python

Login and navigate a website automatically with Selenium

Selenium is an incredible Python package for automating task in the browser. Essentially, Selenium can be used to script interaction with a website by taking control of the browser using Python. 

This example demonstrates how to complete a login form and navigate to various pages behind the login page using just a few of the many techniques available in the Selenium toolbox. 

This example assumes you already have the relevant driver installed and Selenium install (pip install selenium).

Dependencies

from selenium import webdriver

driver = webdriver.Chrome()

Target the first page 

Give the driver the starting URL and check you've landed where you should by running an assertion on the text in the title of the page:

driver.get("https://ilovefluffycats.com/authentication/signon")

assert "cats" in driver.title

Complete the username and password fields

Find the username field by its id in the HTML markup (e.g. id="uid) and the password by the name attribute (e.g. name="pwd")

username = driver.find_element_by_id("uid")
username.clear()
username.send_keys("mrcats")

password = driver.find_element_by_name("pwd")
password.clear()
password.send_keys("catskillz")

Click the login button

Now we need to submit the login credentials by clicking the submit button

driver.find_element_by_name("submitButton").click()

Click a link on the page based on the link text

This is handy where you know the text of the link you want to target, but there's no unique identifier reliably grip onto in the mark up. Here, we're simply looking for a link with the text: "Grumpy cats".

driver.find_element_by_link_text("Grumpy cats").click()

Python Technologies to Try out and Learn (February 2018)

Stuff I'm Currently Learning

NLTK - great library for dealing with text

Applying custom stopwords to a text file

Create inverted list of sentences and files

Splitting a single word into bigrams

Tokenising text files into sentences

Scikit-Learn - heavy duty machine learning library

Scikit-Learn: Cross-validated supervised text classification

Scikit-Learn: Document similarity

Scikit-Learn: Load your own data

Scikit-Learn: Supervised text classification

Django - rapid development web framework (I'm struggling with this one)

Selenium - control a browser with Python

Login and navigate a website with Selenium

Stuff that's on my list of things to learn

  • Pendulum - datetime parsing (the website for this library is gorgeous)
  • AWS Lambda - serverless computation service 

Reverse a string

This is one of those things that sounds quite simple, but seems to generate quite a lot of discussion on the best way to do it. If you're interested in diving into that discussion, take a look at this StackOverflow question and the answers. 

If, however, all you care about is actually reversing a string with Python, here's a couple of ways to do it. 

Let's say we have a string:

mystring = 'Hello my name is Daniel

Method 1

This method uses reverse() and has the benefit of being comparatively readable in comparison to Method 2.

print (''.join(reversed(mystring)))

Returns,

leinaD si eman ym olleH

Method 2

This method uses extended slice, which essentially slices one character of the end of the string in successive steps. Not as obvious as method 1, if you ask me. 

print (mystring[::-1])

Returns,

leinaD si eman ym olleH

 

 

Uniqify a Python List

I came across this really cool blog post outlining various fast ways to remove duplicate values from a Python list. 

Here's the fastest order-preserving example:

def f12(seq):
    return list(dict.fromkeys(seq))

my_list = [1,2,2,2,3,4,5,6,6,6,6]

print (f12(my_list))

Returns,

[1, 2, 3, 4, 5, 6]

This really quick solution appears to have been identified by a chap called Raymond Hettinger, so credit where it's due!

Write to a file

Here's how you write to a file with Python.

Let's say we have a string, myString, the contents of which we want to write to a file. 

myString = "I am going to write this string to a file with Python"

To write myString to a file, we first need to specify the file we want to write to, like so:

file = open('string.txt', 'w')

Then we use the write function (which is a built-in function) to write the string to the file:

file.write(myString)

And that's it!

 

 

 

 

f = open( 'article.txt', 'w')
f.write(first_article.text)

Find most common words in a corpus

This little sample demonstrates several basic text processing steps with a corpus of text files stored in a local directory. 

  • First, we read the corpus of text files into a list
  • Second, we knock out unwanted stuff, like things that aren't actually words and words that only consist of a single character
  • Third, we use standard NLTK stopwords and a list of custom stopwords, to strip out noise from the corpus
  • Finally, we use NLTK to calculate the most common words in each file in the corpus
from __future__ import division
import glob
from nltk.corpus import stopwords
from nltk import *
import re

# Bring in the default English NLTK stop words
stoplist = stopwords.words('english')

# Define additional stopwords in a string
additional_stopwords = """case law lawful judge judgment court mr justice would evidence mr order 
defendant act decision make two london appeal section lord one applicant mr. may could said also application whether 
made time first r miss give appellant november give fact within point sentence question upon matter 
leave part given must notice public state taken course cannot circumstances j that, offence set 
behalf however plaintiff see set say secretary regard - v claim right appeared second put e way material
view relation effect take might particular however, present court) october b reasons basis far 
referred trial found lord, land consider authority subject necessary considered 0171 see,s 
council think legal shall respect ground three case, crown without 2 relevant and, special business told clear
paragraph person account letter therefore jury th solicitor use years mrs mr provision discretion
matters respondent concerned cases defence reason issue well count argument facts gave proceedings 
position period needs approved used power us limited even either exercise counsel applicants submission
although counsel submitted st need appellants plaintiffs policy thomas making tribunal action entitled affadavit
december strand daniel transcript smith purpose refused offence offences general counts terms grounds conclusion number reasonable 
prosecution home hearing seems defendants educational clarke solicitors criminal following accept place come
already accepted required words local l;ater january provided stage report street september day sought greenwood
rather service accounts page hobhouse courts march third wilcock mind result months came learned appropriate date instructed
form division notes july went bernal official review principle consideration affidavit held lordship another dr different
notes quite royal possible instructed shorthand development amount has months wc respondents took clearly since find
satisfied members later fleet took interest parties name change information co sum ec done provisions party hd paid
"""

# Split the the additional stopwords string on each word and then add
# those words to the NLTK stopwords list
stoplist += additional_stopwords.split()

# Define the files that make up the corpus to be modelled

file_list = glob.glob(os.path.join(os.getcwd(), '/Users/danielhoadley/PycharmProjects/topicvis', '*.txt'))

# Construct an empty list into which the content of each file will be stored as a item

corpus = []

# Read the files

for file_path in file_list:
    with open(file_path) as f_input:
        content = f_input.read()
        only_words = re.sub("[^a-zA-Z]", " ", content) # Remove anything that isn't a 'word'
        no_single = re.sub(r'(?:^| )\w(?:$| )', ' ', only_words).strip() # Remove any words consisting of a single character
        corpus.append(no_single)
        f_input.close()

# Remove stopwords

texts = [[word for word in document.lower().split() if word not in stoplist] for document in corpus]

# Get the most common words in each text

for text in texts:
    fdist = FreqDist(text)
    print (fdist.most_common(2))

Create a Gensim Corpus for text files in a local directory

This snippet creates a Gensim corpus from text files stored in a local directory:

import os, gensim

def iter_documents(top_directory):
    """Iterate over all documents, yielding a document (=list of utf8 tokens) at a time."""
    for root, dirs, files in os.walk(top_directory):
        for file in filter(lambda file: file.endswith('.txt'), files):
            document = open(os.path.join(root, file)).read() # read the entire document, as one big string
            yield gensim.utils.tokenize(document, lower=True) # or whatever tokenization suits you

class MyCorpus(object):
    def __init__(self, top_dir):
        self.top_dir = top_dir
        self.dictionary = gensim.corpora.Dictionary(iter_documents(top_dir))
        self.dictionary.filter_extremes(no_below=1, keep_n=30000) # check API docs for pruning params

    def __iter__(self):
        for tokens in iter_documents(self.top_dir):
            yield self.dictionary.doc2bow(tokens)

corpus = MyCorpus('/path/to/files') # create a dictionary
for vector in corpus: # convert each document to a bag-of-word vector
    print (vector)

Strip XML tags out of file

This is a quick and dirty example of using a regular expression to remove XML tags from an an XML file. 

Suppose we have the following XML, sample.xml:

<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

What we want to do is use Python to strip out the XML element tags, so that we're left with something like this:

Tove
Jani
Reminder
Don't forget me this weekend!

Here's how to do it:

import re

text = re.sub('<[^<]+>', "", open("sample.xml").read())
with open("output.txt", "w") as f:
    f.write(text)

Using NLTK to remove stopwords from a text file

Text processing invariably requires that some words in the source corpus be removed before moving on to more complex tasks (such as keyword extraction, summarisation and topic modelling).

The sorts of words to be removed will typically include words that do not of themselves confer much semantic value (e.g. the, it, a, etc). The task in hand may also require additional, specialist words to be removed. This example uses NLTK to bring in a list of core English stopwords and then adds additional custom stopwords to the list. 

from nltk.corpus import stopwords

# Bring in the default English NLTK stop words
stoplist = stopwords.words('english')

# Define additional stopwords in a string
additional_stopwords = """case judge judgment court"""

# Split the the additional stopwords string on each word and then add
# those words to the NLTK stopwords list
stoplist += additional_stopwords.split()

# Open a file and read it into memory
file = open('sample.txt')
text = file.read()

# Apply the stoplist to the text
clean = [word for word in text.split() if word not in stoplist]

It's worth looking at a couple of discreet aspects of this code to see what's going on. 

The stoplist object is storing the NLTK English stopwords as a list:

stoplist = stopwords.words('english')

print (stoplist)

>>> ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your'...]

Then, we're adding our additional stopwords as individual tokens in a string object, additional_stopwords, and using split() to break that string down into individual tokens in a list object:

stoplist += additional_stopwords.split()

The above line of code updates the original stoplist object with the additional stopwords.

The text being passed in is a simple text file, which reads:

this is a case in which a judge sat down on a chair

When we pass the text through our generator, our output is:

print (clean)

>>> ['sat', 'chair']

 

 

 

 

Read a file

Reading the contents of a file in Python is straightforward and there are a couple of nice methods that cater for different use cases.

OPEN THE FILE

Suppose we want to read a file called my_text.txt. First, we open the file:

f = open('my_text.txt', 'r')

We now have the file as an object, f

READ THE ENTIRE FILE INTO A STRING

For most use cases, it's enough to simply read the entire contents of the file into a string. We can do this by using Python's read() method. 

content = f.read()
print (content)

READ THE ALL OF THE LINES IN THE FILE INTO A LIST

Sometimes, you're going to want to deal with the file you're working with at line level. Fortunately, Python's readlines() method is available. The readlines() stores each line in the file to be read as an item in a list.

content = f.readlines()

READ A SPECIFIC LINE IN THE FILE

There maybe times were you want to read a specific line in the file, which is what the readline() method can be used for. 

To access the first line in the file:

content = f.readline()

To access the second line in the file (remember Python is zero-indexed):

content = f.readline(1)

Regex to match upper case words

Here's a short demonstration of how to use a regular expression to identify UPPERCASE words in a bunch of text files. 

The goal in this particular snip is to open and read all of the .rtf files in a given directory and identify only the UPPERCASE words appearing in the file.

import os
import re

directory = '/path/to/files'

regex = r"\b[A-Z][A-Z]+\b"

for filename in os.listdir(directory):
    if filename.endswith(".rtf"):
        with open(filename, 'r') as f:
            transcript = f.read()
            matches = re.finditer(regex, transcript)
            for match in matches:
                print (match[0])

Calculating cosine similarity between documents

This script calculates the cosine similarity between several text documents. At scale, this method can be used to identify similar documents within a larger corpus.

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

# Bring in standard stopwords
stopWords = stopwords.words('english')

print ("\nCalculating document similarity scores...")

# Open and read a bunch of files 
f = open('/Users/Daniel/Documents/Development/Transcript-Data/Test/admin/0102HD41.txt')
doc1 = str(f.read())

f = open('/Users/Daniel/Documents/Development/Transcript-Data/Test/admin/0107HD42.txt')
doc2 = str(f.read())

f = open('/Users/Daniel/Documents/Development/Transcript-Data/Test/admin/0107HD40.txt')
doc3 = str(f.read())

# Create a string to use to test the similarity scoring

train_string = 'By these proceedings for judicial review the Claimant seeks to challenge the decision of the Defendant dated the 23rd of May 2014 refusing the Claimant’s application of the 3rd of January 2012 for naturalisation as a British citizen'

# Construct the training set as a list
train_set = [train_string, doc1, doc2, doc3]

# Set up the vectoriser, passing in the stop words
tfidf_vectorizer = TfidfVectorizer(stop_words=stopWords)

# Apply the vectoriser to the training set
tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_set)

# Print the score
print ("\nSimilarity Score [*] ",cosine_similarity(tfidf_matrix_train[0:1], tfidf_matrix_train))<p>Hello, World!</p>

While loop

While loops do something for so long as a predefined condition is true.

For example, in the snippet below, we have an integer, count, which has been assigned the value of 0

The while loop block then says for so long as the value of count is less than 9, print the value of count. Each iteration of the loop increases the value of count by 1. 

The condition that the value of count is less than 9 remains true for 9 iterations of the loop. At the tenth pass, the value of count becomes 9 and so the condition that count must be less than 9 becomes false, causing the loop to complete.

count = 0
while (count < 9):
    print 'The count is at:', count
    count = count + 1

print "Romeo done."

BeautifulSoup: a very simple example

BeautifulSoup is an excellent Python package that makes web scraping comparatively straightforward.

Essentially, the fundamental sequence of steps is as follows:

  1. Define the url of the page you want to scrape
  2. Open the url
  3. Store the content of the page as an object we can do other stuff with.

For example,

from bs4 import BeautifulSoup
from urllib import urlopen

# Set the target url
url = 'http://www.canlii.ca'

# Open the url
site = urlopen(url)

# Grab the page contents and store it in an object called soup
soup = BeautifulSoup(site, "lxml")

# Find all <table> elements in the page
table = soup.find_all("table")

# print the table elements
print table

Insert a dictionary into MongoDB

MongoDB and Python work so well together because Mongo's BSON document structure essentially mirrors a Python dictionary, where we have a range of key/value pairs. 

Here's a very simple example of how to insert a dictionary into MongoDB:

from pymongo import MongoClient

# Create connection to MongoDB
client = MongoClient('localhost', 27017)
db = client['name_of_database']
collection = db['name_of_collection']

# Build a basic dictionary
d = {'website': 'www.carrefax.com', 'author': 'Daniel Hoadley', 'colour': 'purple'}

# Insert the dictionary into Mongo
collection.insert(d)

Done.

Connect to MongoDB

This snippet demonstrates how to connect to a local instance of MongoDB.

#import pymongo
from pymongo import MongoClient

# Set up the client, by default MongoDB runs on port 27017
client = MongoClient('localhost', 27017)

# Set the name of the MongoDB database you want to connect to
db = client['name_of_database']

# Set the name of the collection within the database you want to connect to
collection = db['name_of_connection']

 

Converting Dates in Python

Dates are one of those annoying things that can be expressed in many different formats. For example, the date at the time of writing is 28 June 2017. I can express today's date in any number of ways, including:

  • June 28, 2017 
  • 28/06/2017
  • 28/06/17
  • 28/6/17
  • 2017-06-28

When you're working with dates in your Python projects, chances are you'll eventually need to wrangle a date in one format into another, so here's an example I came across in my own code recently.

I was parsing RSS feeds published by the British and Irish Legal Information Institute. In the feeds, dates are expressed as, for example, '28 June 2017'. For my purposes, I needed to convert the date into YYYY-mm-dd format (e.g. 2017-06-28). Here's how I dealt with it:

I was capturing the date like so:

date_in_feed = '28 June 2017'

I then set up the converter, which uses strptime, passing in the date I've captured as the first argument and it's format as the second argument:

converter = time.strptime(date_in_feed, '%d %B %Y')

Finally, to get the date the way I want, I use strpftime, passing in the desired format as the first argument and my converted (above) as the second argument.

converted_date = time.strftime('%Y-%m-%d', converter)

In the light of the myriad of alternative date structures you may be dealing with, I think you're best off looking at the datetime documentation here, but I'll add more examples as and when I come across them.

Building a Scatterplot with Pandas and Seaborn

Pandas and Seaborn go together like lemon and lime. In the code below, we're using Pandas to construct a dataframe from a CSV file and Seaborn (which sits on top of matplotlib and makes it look a million times better) is handling the visualisation end of things.

The dataframe consists of three columns, passiveYearactiveYear and Vala where:

activeYear = the year of a case that considered an earlier case

passiveYear = the year of a case which has itself been considered

Vala = the type of consideration the active case meted out against the passive case.

Code

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Parse CSV, bring in the columns and drop null values
df = pd.read_csv('hoad.csv', usecols=['Vala', 'passiveYear', 'activeYear']).dropna()

# Build a grid consisting of a chart for each Vala type

grid = sns.FacetGrid(df, col="Vala", hue="Vala", col_wrap=3, size=3)

# Draw a horizontal line to show the rough midway point along the y axis 
grid.map(plt.axhline, y=1907, ls=":", c=".5")

# Plot where x=active year and y=passiveyear
grid.map(plt.scatter, "activeYear", "passiveYear", marker="o", alpha=0.5)

# Adjust the tick positions and labels
grid.set(xticks=[1800,2015], yticks=[1800,2015],
         xlim=(1955, 2015), ylim=(1800, 2015))

# Adjust the arrangement of the plots
grid.fig.tight_layout(w_pad=1)

plt.show()

This code yields the following visualisation: