A question popped up on Stack Overflow today asking using the NLTK library to tokenise text into bigrams. The question was as follows:
Suppose I want to generate bigrams for the word single Then the output should be a list ['si','in','ng','gl','le'].
I am new to language processing in python. Can someone guide me?
Tokenising text into n-grams using NLTK is pretty well documented and a whole raft of similar questions can be found on Stack Overflow. However, I think the question was marked as a duplicate a tad to hastily.
Virtually all of the answers to n-gram related questions are directed against tokenising a string consisting of multiple words, e.g:
myString = "This is a string with nine words in it"
The string in the question consisted of only one word. The question was really about producing bigrams from the characters that make up a single word, which is a bit different.
Here's one (not necessarily elegant) answer to the question:
import nltk myString = 'single' # Insert a space inbetween each character in myString spaced = '' for ch in myString: spaced = spaced + ch + ' ' # Generate bigrams out of the new spaced string tokenized = spaced.split(" ") myList = list(nltk.bigrams(tokenized)) # Join the items in each tuple in myList together and put them in a new list Bigrams =  for i in myList: Bigrams.append((''.join([w + ' ' for w in i])).strip()) print Bigrams
This will output:
['s i', 'i n', 'n g', 'g l', 'l e', 'e']