I've been experimenting with Latent Dirichlet Allocation for a while in R, but was looking for a topic model algorithm that did not require the number of topics (k) to be defined apriori the application of the algorithm to the text data I wanted to work with.
As a relative newcomer to topic modelling, I hadn't even heard of David Blei or Chong Wang, both of whom I now know to be pioneers in modern topic modelling. Quite by chance, I stumbled on Wang and Blei's implementation of Hierarchical Direchlet Processing in C++, a topic model where the data determine the number of topics.
Getting HDP to work required quite a bit of wrangling and I couldn't find any walkthroughs suitable for novices like me, so I thought it would be worth noting up how I managed to get it to work with a small sample set of text data.
What follows is far from a perfect (or even good) representation of how to apply HDP to text data, but the steps that follow did work for me. Here goes...
It goes without saying that the very first step is to assemble a corpus of text data against which we'll apply the HDP algorithm. In my case, as is usual, I used ten English judgments (all of which are recent decisions from the Criminal Division of the Court of Appeal) in .txt format. Save these into a folder.
Getting the Sample Data ready for HDP
Before we even go near Wang & Blei's algorithm, we need to prepare the sample data in a particular way.
The algorithm requires the data to be in LDA-C format, which looks like this:
[M] [term_1]:[count] [term_2]:[count] ...[term_N]:[count]
where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document.
This presents the first problem, because our data appear as words in .txt format files. Fortunately, there's an excellent Python program called text2ldac that comes to our rescue. Text2ldac takes the data in .txt format and outputs the files we need, in the form in which we need them.
Clone text2ldac from the git repo here.
Once you've pulled down text2ldac, you're ready to take your text files and process them. To do this, go to the command line and run the following command (make adjustments to the example that follows to suit your own directories and filenames:
$ text2ldac danielhoadley$ python text2ldac.py --stopwords stopwords.txt /Users/danielhoadley/Documents/Topic_Model/text2ldac/input
All that's happening here is we're running text2ldac.py, using the --stopwords flag to pass in our stopwords (which, in my case, are in a file named stopword.txt) and then passing in the directory that contains our .txt files.
This will output three files: a .dat file (e.g. input.dat) which is the all import LDA-C formatted input for the HDP algorithm, a .vocab file, which contains all of the words in the corpus (one word per line) and a .dmap file, which lists the input .txt documents.
Time to run HDP
Now that we have our data in the format required by the HDP algorithm, we're ready to apply the algorithm.
For convenience, I recommend copying the three files generated by text2ldac into the folder you're going to run HDP from, but you can leave them wherever you like.
Go to the folder containing the HDP program files and run the following command (again, adjust to your own folder and filenames:
$ ./hdp --algorithm train --data /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.dat --directory train_dir
Let's unpack this a bit:
./hdp invokes the HDP program
--algorithm flag sets the algorithm to be applied, namely
3. The path that follows the algorithm flag points to the .dat file produced by text2ldac
--directory train_dir is telling HDP to place the output files in a directory called
You'll know you've successfully executed the program if the prompt begins printing something that looks like this:
Program starts with following parameters: algorithm:= train data_path:= /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.dat directory:= trainer max_iter= 1000 save_lag= 100 init_topics = 0 random_seed = 1488746763 gamma_a = 1.00 gamma_b = 1.00 alpha_a = 1.00 alpha_b = 1.00 eta = 0.50 #restricted_scans = 5 split-merge = no sampling hyperparam = no reading data from /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.dat number of docs: 9 number of terms : 5795 number of total words : 35865 starting with 7 topics iter = 00000, #topics = 0008, #tables = 0076, gamma = 1.00000, alpha = 1.00000, likelihood = -305223.54210 iter = 00001, #topics = 0008, #tables = 0079, gamma = 1.00000, alpha = 1.00000, likelihood = -301582.68017 iter = 00002, #topics = 0008, #tables = 0079, gamma = 1.00000, alpha = 1.00000, likelihood = -300273.98808
I'm not going to go into all of these parameters here, but the main one to note is the max_iter, which sets the number of time the algorithm walks over the test data.
Note also that the algorithm has decided by itself how many topics it's going to work with (in the above example, 7)
The algorithm will produce a bunch of .dat files. The one we're really interested in should have a name like mode-topics.dat and not like this 00300-topics.dat (which was produced on the 300th iteration of the algorithm's walk).
Printing the topics determined by HDP
Wang & Blei very helpfully provided a R script, print.topics.r, which you can use to turn the results of the algorithm into a human-readable form. This is helpful because the output generated by the algorithm will look like this:
00001 00001 00007 00007 00007 00011 00000 00000 00000 00000
The key thing at this stage is to remember two key files you'll need as input for the R script: the mode-topics.dat file (or similar name) generated by HDP and the .vocab file generated by text2ldac.
Go back to the command line and navigate to the folder that contains print.topics.r. First, you'll need to make the R script executable, so run:
$ sudo chmod +x print.topics.R
$ ./print.topics.r /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/trainer/mode-topics.dat /Users/danielhoadley/Documents/Topic_Model/hdp/hdp/Second_run/input.vocab topics.dat 4
./print.topics.r runs the R script
2. The first argument is the path to the
mode-topics.dat file produced by the HDP algorithm
3. The second argument is the path to the
.vocab file produced by text2ldac
5. Finally, the fourth argument, which isn't mandatory, is the number of terms per topic you wish to output - the default is 5.
If everything has worked as it ought to have done, you'll get an output like this if you look at the topic.dat file in RStudio:
The output is by no means perfect the first time around. Better results will probably depend on hitting the source data with a raft of stop words and tweaking HDP's many parameters, but it's a good start.