Text Classification

Visually clustering case law

I’ve been experimenting with a Python package called Yellowbrick, which provides a suite of visualisers built for gaining an insight into a dataset when working on machine learning problems.

One of my side projects at the moment is looking at ways in which an unseen case law judgment can be processed to determine its high level subject matter (e.g. Crime, Tort, Human Rights, Occupiers’ Liability etc) with text classification.

I did a quick experiment with Yellowbrick to visualise a small portion of the dataset I’m going to use to train a classifier. And here it is:

tSNE projection of 14 case law topic clusters using Yellowbrick

tSNE projection of 14 case law topic clusters using Yellowbrick

The chart above is called a TSNE (t-distributed stochastic neighbour embedding) projection. Each blob on the chart represents a judgment.

I was pretty pleased with this and several quick insights surfaced:

  • Crime may be a bit over represented in my training data, I’ll need to cut it back a bit

  • Some of the labels in the data may need a bit of merging, for example “false . imprisonment” can probably be handled by the “crime” data

  • There are a couple of interesting sub-clusters within the crime data (I’m guessing one of the clusters will be evidence and the other sentencing)

  • Human Rights as a topic sits right in the middle of the field between the crime cluster and the clusters of non-criminal topics

The code

Dependencies

from yellowbrick.text import TSNEVisualizer
from yellowbrick.style import set_palette
from sklearn.feature_extraction.text import TfidfVectorizer
import os

from sklearn.datasets.base import Bunch
from tqdm import tqdm
import matplotlib.pyplot as plt

set_palette('paired')

Load the corpus

def load_corpus(path):
    """
    Loads and wrangles the passed in text corpus by path.
    """

    # Check if the data exists, otherwise download or raise
    if not os.path.exists(path):
        raise ValueError((
            "'{}' dataset has not been downloaded, "
            "use the yellowbrick.download module to fetch datasets"
        ).format(path))

    # Read the directories in the directory as the categories.
    categories = [
        cat for cat in os.listdir(path)
        if os.path.isdir(os.path.join(path, cat))
    ]

    files  = [] # holds the file names relative to the root
    data   = [] # holds the text read from the file
    target = [] # holds the string of the category

    # Load the data from the files in the corpus
    for cat in categories:
        for name in os.listdir(os.path.join(path, cat)):
            files.append(os.path.join(path, cat, name))
            target.append(cat)

            with open(os.path.join(path, cat, name), 'r') as f:
                data.append(f.read())


    # Return the data bunch for use similar to the newsgroups example
    return Bunch(
        categories=categories,
        files=files,
        data=data,
        target=target,
    )  
    corpus = load_corpus('/Users/danielhoadley/Desktop/common_law_subset')

Vectorise and transform the data

tfidf  = TfidfVectorizer(use_idf=True)
docs   = tfidf.fit_transform(corpus.data)
labels = corpus.target

Generate the visualisation

tsne = TSNEVisualizer(size=(1080, 720),title="Case law clusters")
tsne.fit(docs, labels)
tsne.poof()




Using Scikit-Learn to classify your own text data (the short version)

Last month I posted a lengthy article on how to use Scikit-Learn to build a cross-validated classification model on your own text data. The purpose of that article was to provide an entry point for new Scikit-Learn users who wanted to move away from using the built-in datasets (like twentynewsgroups) and focus on their own corpora.

I thought it might be useful to post a condensed version of the longer read for people who wanted to skip over the explanatory material and get started with the code.

As before, the objective of the code is as follows. We have a dataset consisting of multiple directories, each containing n text files. Each directory name acts as a descriptive category label for the files contained within (e.g. technology, finance, food). We're going to use this data to build a classifier capable of recieving new, unlabeled text data and assigning it to the best fitting category.

The code

import sklearn
import numpy as np
from glob import glob
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.externals import joblib

Get paths to labelled data

rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/")

print ('\nGathering labelled categories...\n')

categories = []

Extract the folder paths, reduce down to the label and append to the categories list

for i in rawFolderPaths:
    string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','')
    category = string1.strip('/')
    #print (category)
    categories.append(category)

Load the data

print ('\nLoading the dataset...\n')
docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML",description=None, categories=categories, load_content=True, encoding='utf-8', shuffle=True, random_state=42)

Split the dataset into training and testing sets

print ('\nBuilding out hold-out test sample...\n')
X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data, docs_to_train.target, test_size=0.4)

Transform the training data into tfidf vectors

print ('\nTransforming the training data...\n')
count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(raw_documents=X_train)

tfidf_transformer = TfidfTransformer(use_idf=False)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print (X_train_tfidf.shape)

Transform the test data into tfidf vectors

print ('\nTransforming the test data...\n')
count_vect = CountVectorizer(stop_words='english')
X_test_counts = count_vect.fit_transform(raw_documents=X_test)

tfidf_transformer = TfidfTransformer(use_idf=False)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
print (X_test_tfidf.shape)

print (X_test_tfidf)
print (y_train.shape)

docs_test = X_test

Construct the classifier pipeline using a SGDClassifier algorithm

print ('\nApplying the classifier...\n')
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                     ('tfidf', TfidfTransformer(use_idf=True)),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                      alpha=1e-3, random_state=42, verbose=1)),
])

Fit the model to the training data

text_clf.fit(X_train, y_train)

Run the test data into the model

predicted = text_clf.predict(docs_test)

Calculate mean accuracy of predictions

print (np.mean(predicted == y_test))

Generate labelled performance metrics

print(metrics.classification_report(y_test, predicted,
    target_names=docs_to_train.target_names))

Using scikit-Learn on your own text data

Scikit-learn’s Working with Text Data provides a superb starting point for learning how to harness the power and ease of the sklearn framework for the construction of really powerful and accurate predictive models over text data. The only problem is that scikit-learn’s extensive documentation (and, be in no doubt, the documentation is phenomenal) doesn’t help much if you want to apply a cross-validated model on your own text data

At some point, you’re going to want to move away from experimenting with one of the built-in datasets (e.g. twentynewsgroups) and start doing data science on textual material you understand and care about. 

The purpose of this tutorial is to demonstrate the basic scaffold you need to build to apply the power of scikit-learn to your own text data. I’d recommend methodically working your way through the Working with Text Data tutorial before diving in here, but if you really want to get cracking, read on.

If you can't be bothered reading on and just want to see the code, it's in a repo on GitHub, here.

Objectives

Before we start, let’s be clear about what we’re trying to do. We have a great big collection of text documents (ideally as plain text from the offing). Our documents are, to use the twentynewsgroups example, all news articles. The news articles have been grouped together, in directories, by their subject matter. We might have one subdirectory consisting of technology articles, called Technology. We might have another subdirectory consisting of articles about tennis, called Tennis

Our project directory might look like this (assume each subdirectory has 100 text documents inside):

news_articles \
    art
    business
    culture
    design
    food
    technology
    tennis
    war

The aim of the game is to use this data to train a classifier that is capable analysing a new, unlabelled article and determining which bucket to put it in (this is an article about food, this is an article about business, etc). 

What our code is going to do

We’re going to write some code, using scikit-learn, that does the following:

  • Loads our dataset of news articles and categorises those articles according to the name of the folder they live in (e.g. art, food, tennis)
  • Splits the dataset into two chunks: a chunk we’re going to use to train our classifier and another chunk that we’re going to use to test how good the classifier is
  • Converts the training data into a form the classifier can work with
  • Converts the test data into a form the classifier can work with
  • Builds a classifier 
  • Applies that classifier to our training data
  • Fires the test data into our trained classifier
  • Tells us how well the classifier did at predicting the right label (art, food, tennis etc) of the each document in the test dataset

1. Get the environment ready

The first job is to bring in everything we need from scikit-learn:

import sklearn
import numpy as np
from glob import glob
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline 

That’s the stage set in terms of bringing in our dependencies.

2. Set our categories

The next job is to state the names of the categories (our folders of grouped news articles) in a list. These need to exactly match the names of the subdirectories acting as the categorical buckets in your project directory.

categories = [‘art’, ‘business’, ‘culture’, ‘design’, ‘food’, ‘technology’, ‘tennis’, ‘war’]

This approach of manually setting the folder names works well if you only have a few categories or you’re just using a small sample of a larger set of categories. However, if you’ve got lots of category folders, manually entering them as list items is going to be a bore and will make your code very, very ugly (I'll write a separate blog post on a better way of dealing with this, or look at the repo on GitHub, which incorporates the solution to this problem).

3. Load the data

We’re now ready to load our data in:

docs_to_train = sklearn.datasets.load_files(“/path/to/the/project/folder/“, 
    description=None, categories=categories, 
    load_content=True, encoding='utf-8', shuffle=True, random_state=42)

All we’re doing here is saying that our dataset, docs_to_train, consist of the files contained within all of the subdirectories to the path specified inside the .load_files function and that the categories are the categories set out in our categories list (see above). Forget about the other stuff in there for now.

4. Split the dataset we’ve just loaded into a training set and a test set

This is where the real work begins. We’re going to use the entire dataset, docs_to_train, to both train and test our classifier. For this reason, we’ve got to split the dataset into two chunks: one chunk for training and another chunk (that the classifier won’t get to look at in training) for testing. We’re going to “hold out” 40% of the the dataset for testing:

X_train, X_test, y_train, y_test = train_test_split(docs_to_train.data,
    docs_to_train.target, test_size=0.4)

It’s really important to understand what this line of code is doing. 

First, we’re creating four new objects, X_train, X_test, y_train and y_test. The X objects are going to hold our data, the content of the text files. We’ve got one X object, X_train, and that will hold the text file data we’ll use to train the classifier. We have another X object, X_test, and that will hold the text file data we’ll use to test the classifier. The Xs are the data.

The we have the Ys. The Y objects hold the category names (art, culture, war etc). y_train will hold the category names that correspond to the text data in X_train. y_test will hold the category names category names that correspond to the text data in X_test. The y value are the targets. 

Finally, we’re using test_size=0.4 to say that out of all the data in docs_to_train we want 40% to be held out for the test data in X_test and y_test.

5. Transform the training data into a form the classifier can work with

Our classifier uses mathematics to determine whether Document X belongs in bucket A, B, or C. The classifier therefore expects numeric data rather than text data. This means we’ve got to take our text training data, stored in X_train, and transform it into a form our classifier can work with. 

count_vect = CountVectorizer(stop_words='english')

X_train_counts = count_vect.fit_transform(raw_documents=X_train)

These two lines are doing are a lot of heavy lifting and I would strongly urge you to go back to the Working with Text Data tutorial to fully understand what’s going on here.

The first thing we’re doing is setting up a vectoriser, CountVectorizer(). This is a function that will count the number of times each word in the dataset occurs and project that count into a vector. 

Then, we take that vector and apply it to the training data stored in X_train. We store those occurrence vectors in X_train_counts.

Once that’s done we move on to the clever transformation bit. We’re going to take the occurrence counts, stored in X_train_counts, and transform them into a term frequency inverse document frequency value. 

tfidf_transformer = TfidfTransformer(use_idf=True)

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

Why are we doing this? Well, if you think about it, the documents in your training set will naturally vary in word length; some are going to be long, others are going to be short. Longer documents have more words in them and that’s going to result in a higher word count for each word. That’s going to skew your results. What we really want to do is get get a sense of the count of each word proportionate to the number of words in the document. Tf-idf (term frequency inverse document frequency achieves this). 

6. Transform the test data into a form the classifier can work with

Since we’ve gone to the trouble of splitting the dataset into a training set and a test set, we also need to transform our test data in exactly the same way as we just did with the training set. All we’re doing here is mirroring the transformation process we just applied to X_train onto X_test. 

count_vect = CountVectorizer(stop_words='english')
X_test_counts = count_vect.fit_transform(raw_documents=X_test)

tfidf_transformer = TfidfTransformer(use_idf=True)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)

7. Scikit-learn gives us a far better way to deal with these transformations: pipelines!

It was worth reading about the transformation process, because if you’re working with text data and trying to do science with it you really do need to at least see why and how that text is transformed into a numerical form a predictive classifier can deal with. 

However, scikit-learn actually gives us a far more efficient way (in terms of lines of code) to deal with the transformations — it’s called a pipeline. The pipeline is this example has three phases. The first creates the vectoriser — the machine used to turn our text into numbers — a count of occurrences. The second phase deals with transforming the crude vectorisation handled in the first into a frequency-based representation of the data — the term frequency inverse document frequency. Finally, and most excitingly, the third phase of the pipeline sets up the classifier — the machine that’s going to train the model. 

Here’s the pipeline code:

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer(use_idf=True)),
    ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, 
    verbose=1)),])

For now, don’t worry about the parameters set out in the classifier, just appreciate the structure and content of the pipeline. 

8. Deploy the pipeline and train the model

Now it’s time to train our model by applying the pipeline we’ve just built to our training data. All we’re doing here is taking our training data (X_train) and the corresponding training labels (y_train) and passing them into the fit function that comes built into the pipeline. 

text_clf.fit(X_train, y_train)

Depending on how big your dataset is, this could take a few minutes or a bit longer. 

As a sidenote, on this point, you might have noticed that I set the verbose parameter in the classifier as 1. This is purely so I can see that the classifier is running and that the script isn't hanging because I’m chewing through memory. 

9. Test the model we’ve just trained using the test data

We’ve now trained our model on the training data. It’s time to see how well trained the model really is by letting it loose on our test data. What we’re going to do is take the test data, X_test, let the model evaluate it based on what it learned from being fed the training data (X_train) and the training labels (y_train) and see what categories the model predicts the test data belongs to. 

predicted = text_clf.predict(X_test)

We can measure the model’s accuracy by taking the mean of the classifier’s predictive accuracy like so:

print (np.mean(predicted == y_test))

Better yet, we can use scikit-learn’s built in metrics library to give us some detailed performance statistics by category (i.e. how well did the classifier at predicting that an article about “design” is an article about “design”.

print(metrics.classification_report(y_test, predicted, 
    target_names=docs_to_train.target_names))

The metrics will provide you with a precision score for each category and an overall average of the performance of the model between 0 and 1. 

The closer that average score gets to 1, the better the model will perform. Mind you, if your model is averaging a score of 1 on the nose, something has gone wrong!   

A first run with Hierarchical Dirichlet Process

I've been experimenting with Latent Dirichlet Allocation for a while in R, but was looking for a topic model algorithm that did not require the number of topics (k) to be defined apriori the application of the algorithm to the text data I wanted to work with. 

As a relative newcomer to topic modelling, I hadn't even heard of David Blei or Chong Wang, both of whom I now know to be pioneers in modern topic modelling. Quite by chance, I stumbled on Wang and Blei's implementation of Hierarchical Direchlet Processing in C++, a topic model where the data determine the number of topics. 

Getting HDP to work required quite a bit of wrangling and I couldn't find any walkthroughs suitable for novices like me, so I thought it would be worth noting up how I managed to get it to work with a small sample set of text data. 

What follows is far from a perfect (or even good) representation of how to apply HDP to text data, but the steps that follow did work for me. Here goes...


Sample Data

It goes without saying that the very first step is to assemble a corpus of text data against which we'll apply the HDP algorithm. In my case, as is usual, I used ten English judgments (all of which are recent decisions from the Criminal Division of the Court of Appeal) in .txt format. Save these into a folder.


Getting the Sample Data ready for HDP

Before we even go near Wang & Blei's algorithm, we need to prepare the sample data in a particular way. 

The algorithm requires the data to be in LDA-C format, which looks like this:

 [M] [term_1]:[count] [term_2]:[count] ...[term_N]:[count]
where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document.

This presents the first problem, because our data appear as words in .txt format files. Fortunately, there's an excellent Python program called text2ldac that comes to our rescue. Text2ldac takes the data in .txt format and outputs the files we need, in the form in which we need them.

Clone text2ldac from the git repo here

Once you've pulled down text2ldac, you're ready to take your text files and process them. To do this, go to the command line and run the following command (make adjustments to the example that follows to suit your own directories and filenames:

$ text2ldac danielhoadley$ python text2ldac.py --stopwords stopwords.txt /Users/danielhoadley/Documents/Topic_Model/text2ldac/input

All that's happening here is we're running text2ldac.py, using the --stopwords flag to pass in our stopwords (which, in my case, are in a file named stopword.txt) and then passing in the directory that contains our .txt files.

This will output three files: a .dat file (e.g. input.dat) which is the all import LDA-C formatted input for the HDP algorithm, a .vocab file, which contains all of the words in the corpus (one word per line) and a .dmap file, which lists the input .txt documents. 


Time to run HDP

Now that we have our data in the format required by the HDP algorithm, we're ready to apply the algorithm. 

For convenience, I recommend copying the three files generated by text2ldac into the folder you're going to run HDP from, but you can leave them wherever you like. 

Go to the folder containing the HDP program files and run the following command (again, adjust to your own folder and filenames:

$ ./hdp --algorithm train --data /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.dat --directory train_dir

Let's unpack this a bit: 

1. ./hdp invokes the HDP program

2. The --algorithm flag sets the algorithm to be applied, namely train

3. The path that follows the algorithm flag points to the .dat file produced by text2ldac

4. --directory train_dir is telling HDP to place the output files in a directory called train_dir

You'll know you've successfully executed the program if the prompt begins printing something that looks like this:

Program starts with following parameters:
algorithm:= train
data_path:= /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.dat
directory:= trainer
max_iter= 1000
save_lag= 100
init_topics = 0
random_seed = 1488746763
gamma_a = 1.00
gamma_b = 1.00
alpha_a = 1.00
alpha_b = 1.00
eta = 0.50
#restricted_scans = 5
split-merge = no
sampling hyperparam = no

reading data from /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.dat

number of docs: 9
number of terms : 5795
number of total words : 35865

starting with 7 topics 

iter = 00000, #topics = 0008, #tables = 0076, gamma = 1.00000, alpha = 1.00000, likelihood = -305223.54210
iter = 00001, #topics = 0008, #tables = 0079, gamma = 1.00000, alpha = 1.00000, likelihood = -301582.68017
iter = 00002, #topics = 0008, #tables = 0079, gamma = 1.00000, alpha = 1.00000, likelihood = -300273.98808

I'm not going to go into all of these parameters here, but the main one to note is the max_iter, which sets the number of time the algorithm walks over the test data. 

Note also that the algorithm has decided by itself how many topics it's going to work with (in the above example, 7)

The algorithm will produce a bunch of .dat files. The one we're really interested in should have a name like mode-topics.dat and not like this 00300-topics.dat (which was produced on the 300th iteration of the algorithm's walk). 


Printing the topics determined by HDP

Wang & Blei very helpfully provided a R script, print.topics.r, which you can use to turn the results of the algorithm into a human-readable form. This is helpful because the output generated by the algorithm will look like this:

00001 00001 00007 00007 00007
00011 00000 00000 00000 00000

The key thing at this stage is to remember two key files you'll need as input for the R script: the mode-topics.dat file (or similar name) generated by HDP and the .vocab file generated by text2ldac. 

Go back to the command line and navigate to the folder that contains print.topics.r. First, you'll need to make the R script executable, so run:

$ sudo chmod +x print.topics.R

Then run

$ ./print.topics.r /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/trainer/mode-topics.dat /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.vocab topics.dat 4

1. ./print.topics.r runs the R script

2. The first argument is the path to the mode-topics.dat file produced by the HDP algorithm

3. The second argument is the path to the .vocab file produced by text2ldac

4. The third argument is the name of the file you want to output the human-readable result to, e.g. topics.dat

5. Finally, the fourth argument, which isn't mandatory, is the number of terms per topic you wish to output - the default is 5.


The output

If everything has worked as it ought to have done, you'll get an output like this if you look at the topic.dat file in RStudio:

The output is by no means perfect the first time around. Better results will probably depend on hitting the source data with a raft of stop words and tweaking HDP's many parameters, but it's a good start.

 

 

 

 

Algorithmically Topic Modelling Judgments

Like many others that work in the information/publishing sector, I have developed a keen interest in learning how to make use of machine learning and text mining technology to enhance the information I work on (in my working context, the information is case law). 

Over the Christmas break I started to experiment with a statistical programming language called R, which has a decent suite of text mining functionality available right out of the box.  

I wanted to put R to use to tackle a simple and practical question: is it possible to accurately classify a judgment algorithmically with relative ease?

In order to test R against this particular use-case, I constructed a simple experiment.

Build sample corpus of data

To run the experiment I needed a small batch of sample judgments. I selected nine recent judgments from BAILII: three from the Criminal Division of the Court of Appeal; three from the Family Court; and three from the Commercial Court. 

Success Factors

For R to be successful in the experiment, it would need to algorithmically classify the nine cases to the correct topic, i.e. the three criminal cases should be grouped together, as should the commercial cases, etc. 

Method

I’ll write up more detailed notes on the method and code used in this experiment, but essentially the following steps would be taken:

  1. Load the nine judgments into R as text files
  2. Pre-process the text files to remove unwanted material (like punctuation, numbers and standard stop words).
  3. Analyse the most frequently occurring terms in the corpus of judgments and remove additional stop words that would be common across the entire dataset.
  4. Apply a topic modelling algorithm (the Latent Dirichlet allocation (LDA) model) to the corpus of judgments to algorithmically allocate each of the judgments to one of three topics (criminal, family or commercial).
  5. Match up the judgments to a topic
  6. Produce a matrix of the key terms governing allocation into a topic
  7. Produce a matrix detailing the respective probability for each case allocation to a topic.

Results

The LDA algorithm did a decent job of allocating judgments to one of the three topics. First off, we can take a look at the key terms the model has used to allocate each case to a topic. The first column in the chart below, for example, sets out the terms viewed as most relevant to allocate a judgment to the family topic. 

Family Commercial Criminal
1 children claus court
2 order claim evid
3 court polici appel
4 evid vote appeal
5 child period case
6 made manag convict

Now let's look at the actual allocation of judgment to topics:

V1
a.txt Fam
adamantine.txt Comm
arc.txt Comm
canary.txt Comm
f.txt Fam
n.txt Fam
r_v_amjad.txt Crim
r_v_burke.txt Crim
r_v_garland.txt Crim

Each of the nine judgments has been allocated to the correct topic. We can also have a deeper look at the probability for each allocation.

Family Commercial Criminal
a.txt 0.66 0.13 0.21
adamantine 0.05 0.92 0.04
arc.txt 0.06 0.90 0.04
canary.txt 0.05 0.91 0.04
f.txt 0.82 0.03 0.15
n.txt 0.86 0.03 0.11
r_v_amjad.txt 0.11 0.08 0.82
r_v_burke.txt 0.14 0.10 0.76
r_v_garland.txt 0.12 0.02 0.85

Conclusion

This experiment shows that it is possible to use a language like R to accurately fit a topic model onto the text of judgments. However, there are two obvious limitations associated with this particular implementation of text classification.

First, the number of topics needs to be defined a priori. That's absolutely fine where you have an idea of the number of topics in advance of the modelling (as in the case of this experiment), but if you don't know how many topics there are in the corpus, you'll probably have to run the model more the once and experiment with the number of topics. 

The second issue is one of scalability. This experiment used a small corpus (only nine judgments). The larger and more varied the corpus, the more heavy lifting is required to stage the data well. 

However, regardless of these limitations, the fantastic thing about this modelling approach is the ease with which it can be deployed.