Rapid Keyword Extraction of Donoghue v Stevenson

Sometimes it would be really handy to be able to quickly and accurately extract keywords from a large corpus of documents. It is quite easy to foresee such a use-case arising in legal publishing, for example. 

RAKE (Rapid Keyword Extraction), is a Python natural language processing module that goes a long way in dealing with this use-case. 

I was interested in putting RAKE to the test and thought I'd pit the algorithm against what is perhaps to most well known piece of case law in the common law world: Donoghue v Stevenson (of snail and ginger beer fame). 

What follows is the basic "working out" of the code and the results of the first pass. For anyone interested in replicating this experiment or doing some keyword extraction of their own, see this excellent tutorial - you'll see that my own code follows it closely.

IMPORT THE RELEVANT LIBRARIES

import rake 
import operator

INITIALISE RAKE

rake_object = rake.Rake("smartstoplist.txt", 5, 5, 7)

This line of code does the following:

  • Creates a RAKE object that extracts keywords where (i) each word has at least 5 characters; (ii) each phrase must have at least 5 words; and (iii) each keyword must appear in the text at least 7 times
  • Hits the text file with a list of stop words to remove textual noise

GET THE TEXT

Now we open the text file (in this test, I've saved the judgment in Donoghue as a text file) and save it in a variable:

judgment = open("dono.txt","r") 
text = judgment.read()

RUN RAKE AND PRINT THE KEYWORDS

Now we're ready to run RAKE over the text to get the keywords:

keywords = rake_object.run(text) 
print (keywords)

THE OUTPUT

The following keywords (along with their scores) were returned:

[('give rise', 4.300000000000001), ('common law', 4.184313725490196), ('duty owed', 4.154061624649859), ('ordinary care', 4.115278543849972), ('reasonable care', 4.093482554312047), ('skivington lr 5', 4.050000000000001), ('lake & elliot', 4.0), ('pender 11 qb', 3.966666666666667), ('present case', 3.7993197278911564), ('defective', 1.7619047619047619), ('present', 1.7380952380952381), ('principles', 1.7333333333333334), ('dangerous', 1.6491228070175439), ('exercise', 1.588235294117647), ('cases', 1.5875), ('bottles', 1.5833333333333333), ('liability', 1.5789473684210527), ('relationship', 1.5555555555555556), ('court', 1.5365853658536586), ('supplying', 1.5), ('appears', 1.4761904761904763), ('principle', 1.4736842105263157), ('allowed', 1.4545454545454546), ('party', 1.4375), ('nature', 1.4210526315789473), ('warranty', 1.4166666666666667), ('goods', 1.4090909090909092), ('thing', 1.4090909090909092), ('articles', 1.4), ('condition', 1.4), ('appellant', 1.3953488372093024), ('injured', 1.3863636363636365), ('alleged', 1.375), ('bought', 1.3636363636363635), ('stated', 1.3636363636363635), ('examination', 1.3636363636363635), ('opportunity', 1.3636363636363635), ('appeal', 1.3333333333333333), ('support', 1.3333333333333333), ('defect', 1.3333333333333333), ('decided', 1.3333333333333333), ('relation', 1.3333333333333333), ('bottle', 1.3225806451612903), ('matter', 1.3125), ('authorities', 1.3125), ('injury', 1.3076923076923077), ('carelessness', 1.3076923076923077), ('judgment', 1.3055555555555556), ('proposition', 1.3043478260869565), ('recover', 1.3), ('referred', 1.3), ('circumstances', 1.2972972972972974), ('supplied', 1.2857142857142858), ('found', 1.2857142857142858), ('based', 1.2777777777777777), ('defendant', 1.2666666666666666), ('liable', 1.263157894736842), ('article', 1.26), ('manufactured', 1.25), ('lordships', 1.25), ('danger', 1.25), ('means', 1.25), ('poison', 1.25), ('inspection', 1.2307692307692308), ('purchaser', 1.2272727272727273), ('george', 1.2272727272727273), ('person', 1.2222222222222223), ('courts', 1.2222222222222223), ('house', 1.2105263157894737), ('plaintiff', 1.2096774193548387), ('chattel', 1.2), ('decision', 1.1935483870967742), ('entitled', 1.1818181818181819), ('authority', 1.1666666666666667), ('vendor', 1.1666666666666667), ('dicta', 1.1666666666666667), ('premises', 1.1538461538461537), ('repair', 1.1538461538461537), ('question', 1.1515151515151516), ('pursuer', 1.1428571428571428), ('manufacturer', 1.1384615384615384), ('facts', 1.1333333333333333), ('persons', 1.1333333333333333), ('subject', 1.125), ('class', 1.125), ('scotland', 1.125), ('evidence', 1.125), ('manufacturers', 1.125), ('defender', 1.125), ('contents', 1.1176470588235294), ('words', 1.1), ('longmeid', 1.1), ('holliday 6', 1.1), ('exist', 1.1), ('consequence', 1.1), ('negligence', 1.0985915492957747), ('contract', 1.0918367346938775), ('difficult', 1.0833333333333333), ('proved', 1.0833333333333333), ('respect', 1.0833333333333333), ('respondent', 1.08), ('consumer', 1.0789473684210527), ('proof', 1.0714285714285714), ('regard', 1.0714285714285714), ('manufacture', 1.0666666666666667), ('knowledge', 1.0666666666666667), ('england', 1.0588235294117647), ('langridge', 1.0555555555555556), ('action', 1.0476190476190477), ('opinion', 1.0357142857142858), ('lords', 1.0), ('ginger', 1.0), ('retailer', 1.0), ('result', 1.0), ('neglect', 1.0), ('division', 1.0), ('ground', 1.0), ('fraud', 1.0), ('judgments', 1.0), ('parke', 1.0), ('levy 2', 1.0), ('winterbottom', 1.0), ('wright 10', 1.0), ('stranger', 1.0), ('coach', 1.0), ('reason', 1.0), ('blacker', 1.0), ('breach', 1.0), ('skill', 1.0), ('parties', 1.0), ('brett', 1.0), ('heaven', 1.0), ('point', 1.0), ('treated', 1.0), ('property', 1.0), ('purpose', 1.0), ('thought', 1.0), ('existence', 1.0), ('pointed', 1.0), ('argument', 1.0), ('defendants', 1.0), ('hamilton', 1.0), ('contention', 1.0), ('mullen', 1.0), ('barr &', 1.0), ('defenders', 1.0), ('members', 1.0), ('remote', 1.0), ('bridge', 1.0)]

I was fairly chuffed with these results given it was the first attempt. The key seems to be getting the right balance of parameters when setting the object up. But, it's good to see terms like duty owed and reasonable care appearing at the top of the results. 

It definitely needs some fine tuning and probably an expansion of the stop list, but it's a good start.