A (Brief) Excursion into Topic Modelling with Mallet

For the past three months or so, I've been experimenting with a range of topic models across a range of technologies, including R, Python, C++ and Java. 

I've recently been spending time with MALLET (a Java-based suite of NLP tools) and I'm really impressed with how easy this implementation is to get working. 

There is a truly excellent walkthrough courtesy of the Programming Historian right here, which makes everything perfectly clear if you're coming towards this without much experience. 

As is usual, I tested MALLET with a reasonably large corpus of judgments from the Criminal Division of the Court of Appeal, which I had organised as .txt files in a directory on my machine.

The following steps provide a basic outline of how I got everything going:

Download MALLET

Speaks for itself. You can download MALLET here. Unzip the .tar file to a directory of your choosing.

Import the data

The first thing we need to do is import the data into MALLET. To do this, go to the directory in which you unpacked the MALLET .tar file at the command line and then run the following command:

bin/mallet import-dir --input path/to/the/your/data --output topic-input.mallet \
  --keep-sequence --remove-stopwords

This runs MALLET, points it to the directory holding your data, creates the input file you'll use in the next step (topic-input.mallet) and removes uninteresting words (like a, of, the, for, etc)

Build the topic model

The steps above shouldn't have taken you much more than 5-10 minutes. This bit is the fun part - building the topic model. 

At the command line, run:

bin/mallet train-topics --input topic-input.mallet --num-topics 50 --output-state topic-state.gz --output-doc-topics doc-topics.txt --output-topic-keys topic_keys.txt

This passes in the input file generated in the step above, sets the number of topics to generate at 50 and then specifies a range of outputs. 

The most interesting outputs generated are:

  • topic-keys.txt, which sets out the topics and the key terms within those topics
  • doc-topics, which sets out the main topic allocations for each document in the dataset.