Text processing invariably requires that some words in the source corpus be removed before moving on to more complex tasks (such as keyword extraction, summarisation and topic modelling).
The sorts of words to be removed will typically include words that do not of themselves confer much semantic value (e.g. the, it, a, etc). The task in hand may also require additional, specialist words to be removed. This example uses NLTK to bring in a list of core English stopwords and then adds additional custom stopwords to the list.
from nltk.corpus import stopwords # Bring in the default English NLTK stop words stoplist = stopwords.words('english') # Define additional stopwords in a string additional_stopwords = """case judge judgment court""" # Split the the additional stopwords string on each word and then add # those words to the NLTK stopwords list stoplist += additional_stopwords.split() # Open a file and read it into memory file = open('sample.txt') text = file.read() # Apply the stoplist to the text clean = [word for word in text.split() if word not in stoplist]
It's worth looking at a couple of discreet aspects of this code to see what's going on.
The stoplist object is storing the NLTK English stopwords as a list:
stoplist = stopwords.words('english') print (stoplist) >>> ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your'...]
Then, we're adding our additional stopwords as individual tokens in a string object, additional_stopwords, and using split() to break that string down into individual tokens in a list object:
stoplist += additional_stopwords.split()
The above line of code updates the original stoplist object with the additional stopwords.
The text being passed in is a simple text file, which reads:
this is a case in which a judge sat down on a chair
When we pass the text through our generator, our output is:
print (clean) >>> ['sat', 'chair']