Python

Visually clustering case law

I’ve been experimenting with a Python package called Yellowbrick, which provides a suite of visualisers built for gaining an insight into a dataset when working on machine learning problems.

One of my side projects at the moment is looking at ways in which an unseen case law judgment can be processed to determine its high level subject matter (e.g. Crime, Tort, Human Rights, Occupiers’ Liability etc) with text classification.

I did a quick experiment with Yellowbrick to visualise a small portion of the dataset I’m going to use to train a classifier. And here it is:

tSNE projection of 14 case law topic clusters using Yellowbrick

tSNE projection of 14 case law topic clusters using Yellowbrick

The chart above is called a TSNE (t-distributed stochastic neighbour embedding) projection. Each blob on the chart represents a judgment.

I was pretty pleased with this and several quick insights surfaced:

  • Crime may be a bit over represented in my training data, I’ll need to cut it back a bit

  • Some of the labels in the data may need a bit of merging, for example “false . imprisonment” can probably be handled by the “crime” data

  • There are a couple of interesting sub-clusters within the crime data (I’m guessing one of the clusters will be evidence and the other sentencing)

  • Human Rights as a topic sits right in the middle of the field between the crime cluster and the clusters of non-criminal topics

The code

Dependencies

from yellowbrick.text import TSNEVisualizer
from yellowbrick.style import set_palette
from sklearn.feature_extraction.text import TfidfVectorizer
import os

from sklearn.datasets.base import Bunch
from tqdm import tqdm
import matplotlib.pyplot as plt

set_palette('paired')

Load the corpus

def load_corpus(path):
    """
    Loads and wrangles the passed in text corpus by path.
    """

    # Check if the data exists, otherwise download or raise
    if not os.path.exists(path):
        raise ValueError((
            "'{}' dataset has not been downloaded, "
            "use the yellowbrick.download module to fetch datasets"
        ).format(path))

    # Read the directories in the directory as the categories.
    categories = [
        cat for cat in os.listdir(path)
        if os.path.isdir(os.path.join(path, cat))
    ]

    files  = [] # holds the file names relative to the root
    data   = [] # holds the text read from the file
    target = [] # holds the string of the category

    # Load the data from the files in the corpus
    for cat in categories:
        for name in os.listdir(os.path.join(path, cat)):
            files.append(os.path.join(path, cat, name))
            target.append(cat)

            with open(os.path.join(path, cat, name), 'r') as f:
                data.append(f.read())


    # Return the data bunch for use similar to the newsgroups example
    return Bunch(
        categories=categories,
        files=files,
        data=data,
        target=target,
    )  
    corpus = load_corpus('/Users/danielhoadley/Desktop/common_law_subset')

Vectorise and transform the data

tfidf  = TfidfVectorizer(use_idf=True)
docs   = tfidf.fit_transform(corpus.data)
labels = corpus.target

Generate the visualisation

tsne = TSNEVisualizer(size=(1080, 720),title="Case law clusters")
tsne.fit(docs, labels)
tsne.poof()




Using Scikit-Learn to classify your own text data (the short version)

Last month I posted a lengthy article on how to use Scikit-Learn to build a cross-validated classification model on your own text data. The purpose of that article was to provide an entry point for new Scikit-Learn users who wanted to move away from using the built-in datasets (like twentynewsgroups) and focus on their own corpora.

I thought it might be useful to post a condensed version of the longer read for people who wanted to skip over the explanatory material and get started with the code.

As before, the objective of the code is as follows. We have a dataset consisting of multiple directories, each containing n text files. Each directory name acts as a descriptive category label for the files contained within (e.g. technology, finance, food). We're going to use this data to build a classifier capable of recieving new, unlabeled text data and assigning it to the best fitting category.

The code

import sklearn
import numpy as np
from glob import glob
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.externals import joblib

Get paths to labelled data

rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/")

print ('\nGathering labelled categories...\n')

categories = []

Extract the folder paths, reduce down to the label and append to the categories list

for i in rawFolderPaths:
    string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','')
    category = string1.strip('/')
    #print (category)
    categories.append(category)

Load the data

print ('\nLoading the dataset...\n')
docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML",description=None, categories=categories, load_content=True, encoding='utf-8', shuffle=True, random_state=42)

Split the dataset into training and testing sets

print ('\nBuilding out hold-out test sample...\n')
X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data, docs_to_train.target, test_size=0.4)

Transform the training data into tfidf vectors

print ('\nTransforming the training data...\n')
count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(raw_documents=X_train)

tfidf_transformer = TfidfTransformer(use_idf=False)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print (X_train_tfidf.shape)

Transform the test data into tfidf vectors

print ('\nTransforming the test data...\n')
count_vect = CountVectorizer(stop_words='english')
X_test_counts = count_vect.fit_transform(raw_documents=X_test)

tfidf_transformer = TfidfTransformer(use_idf=False)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
print (X_test_tfidf.shape)

print (X_test_tfidf)
print (y_train.shape)

docs_test = X_test

Construct the classifier pipeline using a SGDClassifier algorithm

print ('\nApplying the classifier...\n')
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                     ('tfidf', TfidfTransformer(use_idf=True)),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                      alpha=1e-3, random_state=42, verbose=1)),
])

Fit the model to the training data

text_clf.fit(X_train, y_train)

Run the test data into the model

predicted = text_clf.predict(docs_test)

Calculate mean accuracy of predictions

print (np.mean(predicted == y_test))

Generate labelled performance metrics

print(metrics.classification_report(y_test, predicted,
    target_names=docs_to_train.target_names))