Topic models

A (Brief) Excursion into Topic Modelling with Mallet

For the past three months or so, I've been experimenting with a range of topic models across a range of technologies, including R, Python, C++ and Java. 

I've recently been spending time with MALLET (a Java-based suite of NLP tools) and I'm really impressed with how easy this implementation is to get working. 

There is a truly excellent walkthrough courtesy of the Programming Historian right here, which makes everything perfectly clear if you're coming towards this without much experience. 

As is usual, I tested MALLET with a reasonably large corpus of judgments from the Criminal Division of the Court of Appeal, which I had organised as .txt files in a directory on my machine.

The following steps provide a basic outline of how I got everything going:

Download MALLET

Speaks for itself. You can download MALLET here. Unzip the .tar file to a directory of your choosing.

Import the data

The first thing we need to do is import the data into MALLET. To do this, go to the directory in which you unpacked the MALLET .tar file at the command line and then run the following command:

bin/mallet import-dir --input path/to/the/your/data --output topic-input.mallet \
  --keep-sequence --remove-stopwords

This runs MALLET, points it to the directory holding your data, creates the input file you'll use in the next step (topic-input.mallet) and removes uninteresting words (like a, of, the, for, etc)

Build the topic model

The steps above shouldn't have taken you much more than 5-10 minutes. This bit is the fun part - building the topic model. 

At the command line, run:

bin/mallet train-topics --input topic-input.mallet --num-topics 50 --output-state topic-state.gz --output-doc-topics doc-topics.txt --output-topic-keys topic_keys.txt

This passes in the input file generated in the step above, sets the number of topics to generate at 50 and then specifies a range of outputs. 

The most interesting outputs generated are:

  • topic-keys.txt, which sets out the topics and the key terms within those topics
  • doc-topics, which sets out the main topic allocations for each document in the dataset.

A first run with Hierarchical Dirichlet Process

I've been experimenting with Latent Dirichlet Allocation for a while in R, but was looking for a topic model algorithm that did not require the number of topics (k) to be defined apriori the application of the algorithm to the text data I wanted to work with. 

As a relative newcomer to topic modelling, I hadn't even heard of David Blei or Chong Wang, both of whom I now know to be pioneers in modern topic modelling. Quite by chance, I stumbled on Wang and Blei's implementation of Hierarchical Direchlet Processing in C++, a topic model where the data determine the number of topics. 

Getting HDP to work required quite a bit of wrangling and I couldn't find any walkthroughs suitable for novices like me, so I thought it would be worth noting up how I managed to get it to work with a small sample set of text data. 

What follows is far from a perfect (or even good) representation of how to apply HDP to text data, but the steps that follow did work for me. Here goes...


Sample Data

It goes without saying that the very first step is to assemble a corpus of text data against which we'll apply the HDP algorithm. In my case, as is usual, I used ten English judgments (all of which are recent decisions from the Criminal Division of the Court of Appeal) in .txt format. Save these into a folder.


Getting the Sample Data ready for HDP

Before we even go near Wang & Blei's algorithm, we need to prepare the sample data in a particular way. 

The algorithm requires the data to be in LDA-C format, which looks like this:

 [M] [term_1]:[count] [term_2]:[count] ...[term_N]:[count]
where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document.

This presents the first problem, because our data appear as words in .txt format files. Fortunately, there's an excellent Python program called text2ldac that comes to our rescue. Text2ldac takes the data in .txt format and outputs the files we need, in the form in which we need them.

Clone text2ldac from the git repo here

Once you've pulled down text2ldac, you're ready to take your text files and process them. To do this, go to the command line and run the following command (make adjustments to the example that follows to suit your own directories and filenames:

$ text2ldac danielhoadley$ python text2ldac.py --stopwords stopwords.txt /Users/danielhoadley/Documents/Topic_Model/text2ldac/input

All that's happening here is we're running text2ldac.py, using the --stopwords flag to pass in our stopwords (which, in my case, are in a file named stopword.txt) and then passing in the directory that contains our .txt files.

This will output three files: a .dat file (e.g. input.dat) which is the all import LDA-C formatted input for the HDP algorithm, a .vocab file, which contains all of the words in the corpus (one word per line) and a .dmap file, which lists the input .txt documents. 


Time to run HDP

Now that we have our data in the format required by the HDP algorithm, we're ready to apply the algorithm. 

For convenience, I recommend copying the three files generated by text2ldac into the folder you're going to run HDP from, but you can leave them wherever you like. 

Go to the folder containing the HDP program files and run the following command (again, adjust to your own folder and filenames:

$ ./hdp --algorithm train --data /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.dat --directory train_dir

Let's unpack this a bit: 

1. ./hdp invokes the HDP program

2. The --algorithm flag sets the algorithm to be applied, namely train

3. The path that follows the algorithm flag points to the .dat file produced by text2ldac

4. --directory train_dir is telling HDP to place the output files in a directory called train_dir

You'll know you've successfully executed the program if the prompt begins printing something that looks like this:

Program starts with following parameters:
algorithm:= train
data_path:= /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.dat
directory:= trainer
max_iter= 1000
save_lag= 100
init_topics = 0
random_seed = 1488746763
gamma_a = 1.00
gamma_b = 1.00
alpha_a = 1.00
alpha_b = 1.00
eta = 0.50
#restricted_scans = 5
split-merge = no
sampling hyperparam = no

reading data from /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.dat

number of docs: 9
number of terms : 5795
number of total words : 35865

starting with 7 topics 

iter = 00000, #topics = 0008, #tables = 0076, gamma = 1.00000, alpha = 1.00000, likelihood = -305223.54210
iter = 00001, #topics = 0008, #tables = 0079, gamma = 1.00000, alpha = 1.00000, likelihood = -301582.68017
iter = 00002, #topics = 0008, #tables = 0079, gamma = 1.00000, alpha = 1.00000, likelihood = -300273.98808

I'm not going to go into all of these parameters here, but the main one to note is the max_iter, which sets the number of time the algorithm walks over the test data. 

Note also that the algorithm has decided by itself how many topics it's going to work with (in the above example, 7)

The algorithm will produce a bunch of .dat files. The one we're really interested in should have a name like mode-topics.dat and not like this 00300-topics.dat (which was produced on the 300th iteration of the algorithm's walk). 


Printing the topics determined by HDP

Wang & Blei very helpfully provided a R script, print.topics.r, which you can use to turn the results of the algorithm into a human-readable form. This is helpful because the output generated by the algorithm will look like this:

00001 00001 00007 00007 00007
00011 00000 00000 00000 00000

The key thing at this stage is to remember two key files you'll need as input for the R script: the mode-topics.dat file (or similar name) generated by HDP and the .vocab file generated by text2ldac. 

Go back to the command line and navigate to the folder that contains print.topics.r. First, you'll need to make the R script executable, so run:

$ sudo chmod +x print.topics.R

Then run

$ ./print.topics.r /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/trainer/mode-topics.dat /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.vocab topics.dat 4

1. ./print.topics.r runs the R script

2. The first argument is the path to the mode-topics.dat file produced by the HDP algorithm

3. The second argument is the path to the .vocab file produced by text2ldac

4. The third argument is the name of the file you want to output the human-readable result to, e.g. topics.dat

5. Finally, the fourth argument, which isn't mandatory, is the number of terms per topic you wish to output - the default is 5.


The output

If everything has worked as it ought to have done, you'll get an output like this if you look at the topic.dat file in RStudio:

The output is by no means perfect the first time around. Better results will probably depend on hitting the source data with a raft of stop words and tweaking HDP's many parameters, but it's a good start.