Find most common words in a corpus

This little sample demonstrates several basic text processing steps with a corpus of text files stored in a local directory. 

  • First, we read the corpus of text files into a list
  • Second, we knock out unwanted stuff, like things that aren't actually words and words that only consist of a single character
  • Third, we use standard NLTK stopwords and a list of custom stopwords, to strip out noise from the corpus
  • Finally, we use NLTK to calculate the most common words in each file in the corpus
from __future__ import division
import glob
from nltk.corpus import stopwords
from nltk import *
import re

# Bring in the default English NLTK stop words
stoplist = stopwords.words('english')

# Define additional stopwords in a string
additional_stopwords = """case law lawful judge judgment court mr justice would evidence mr order 
defendant act decision make two london appeal section lord one applicant mr. may could said also application whether 
made time first r miss give appellant november give fact within point sentence question upon matter 
leave part given must notice public state taken course cannot circumstances j that, offence set 
behalf however plaintiff see set say secretary regard - v claim right appeared second put e way material
view relation effect take might particular however, present court) october b reasons basis far 
referred trial found lord, land consider authority subject necessary considered 0171 see,s 
council think legal shall respect ground three case, crown without 2 relevant and, special business told clear
paragraph person account letter therefore jury th solicitor use years mrs mr provision discretion
matters respondent concerned cases defence reason issue well count argument facts gave proceedings 
position period needs approved used power us limited even either exercise counsel applicants submission
although counsel submitted st need appellants plaintiffs policy thomas making tribunal action entitled affadavit
december strand daniel transcript smith purpose refused offence offences general counts terms grounds conclusion number reasonable 
prosecution home hearing seems defendants educational clarke solicitors criminal following accept place come
already accepted required words local l;ater january provided stage report street september day sought greenwood
rather service accounts page hobhouse courts march third wilcock mind result months came learned appropriate date instructed
form division notes july went bernal official review principle consideration affidavit held lordship another dr different
notes quite royal possible instructed shorthand development amount has months wc respondents took clearly since find
satisfied members later fleet took interest parties name change information co sum ec done provisions party hd paid

# Split the the additional stopwords string on each word and then add
# those words to the NLTK stopwords list
stoplist += additional_stopwords.split()

# Define the files that make up the corpus to be modelled

file_list = glob.glob(os.path.join(os.getcwd(), '/Users/danielhoadley/PycharmProjects/topicvis', '*.txt'))

# Construct an empty list into which the content of each file will be stored as a item

corpus = []

# Read the files

for file_path in file_list:
    with open(file_path) as f_input:
        content =
        only_words = re.sub("[^a-zA-Z]", " ", content) # Remove anything that isn't a 'word'
        no_single = re.sub(r'(?:^| )\w(?:$| )', ' ', only_words).strip() # Remove any words consisting of a single character

# Remove stopwords

texts = [[word for word in document.lower().split() if word not in stoplist] for document in corpus]

# Get the most common words in each text

for text in texts:
    fdist = FreqDist(text)
    print (fdist.most_common(2))

[StackOverflow] how can I generate bigrams for words using NLTK python library?

A question popped up on Stack Overflow today asking using the NLTK library to tokenise text into bigrams. The question was as follows:

Suppose I want to generate bigrams for the word single Then the output should be a list ['si','in','ng','gl','le'].

I am new to language processing in python. Can someone guide me?

Tokenising text into n-grams using NLTK is pretty well documented and a whole raft of similar questions can be found on Stack Overflow. However, I think the question was marked as a duplicate a tad to hastily. 

Virtually all of the answers to n-gram related questions are directed against tokenising a string consisting of multiple words, e.g:

myString = "This is a string with nine words in it"

The string in the question consisted of only one word. The question was really about producing bigrams from the characters that make up a single word, which is a bit different. 

Here's one (not necessarily elegant) answer to the question:

import nltk

myString = 'single'

# Insert a space inbetween each character in myString

spaced = ''
for ch in myString:
    spaced = spaced + ch + ' '

# Generate bigrams out of the new spaced string

tokenized = spaced.split(" ")
myList = list(nltk.bigrams(tokenized))

# Join the items in each tuple in myList together and put them in a new list

Bigrams = []

for i in myList:
    Bigrams.append((''.join([w + ' ' for w in i])).strip())

print Bigrams

This will output:

['s i', 'i n', 'n g', 'g l', 'l e', 'e']

Tokenize text file into sentences with Python

I recently needed to split a document into sentences in a way that handled most, if not all, of the annoying edge cases. After a frustrating period trying to get a snippet I found on Stackoverflow to work, I finally figured it out:

import codecs
import os

doc ='path/to/text/file/text.txt', 'r' 'utf-8')
content =

tokenizer ='tokenizers/punkt/english.pickle')

print ('\n-----\n'.join(tokenizer.tokenize(content)))

Overriding SSL verification in Python Launcher

I was running into a frustrating issue when trying to download a bunch of nltk test corpora in Python Launcher: the launcher kept saying it could verify my SSL certificate, which meant that I was unable to download the materials I wanted. 

It turns out that the way around this problem is as follows:

1. Create an unverified SSL context

>>> import ssl
>>> ssl._create_default_https_context = ssl._create_unverified_context

2. Then run the download

>>> import nltk

3. Download the packages in Launcher