Tokenize text file into sentences with Python

I recently needed to split a document into sentences in a way that handled most, if not all, of the annoying edge cases. After a frustrating period trying to get a snippet I found on Stackoverflow to work, I finally figured it out:

import codecs
import os

doc ='path/to/text/file/text.txt', 'r' 'utf-8')
content =

tokenizer ='tokenizers/punkt/english.pickle')

print ('\n-----\n'.join(tokenizer.tokenize(content)))

Text Summarisation with Gensim

A quick rundown of summarising texts with Gensim in Python3. 

Ensure the gensim module is installed. Here's the code to summarise a single text file:

from gensim.summarization import summarize 
import logging 

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 

f = open("telegraph.txt","r") 
text = 
print(summarize(text, word_count=100)) 
print(summarize(text, ratio=0.5))

Test output of Donoghue v Stevenson with the output summary constrained to 100 words (I personally think the summariser has done an excellent job - it's calculated that the final paragraph of Lord Atkin's speech provides the best summary of the judgment!!):

My Lords, if your Lordships accept the  view that this pleading discloses a relevant cause of action you will be affirming the proposition that by Scots and English law alike a manufacturer of products, which he sells in such a form as to show that he intends them to reach the ultimate consumer in the form in which they left him with no reasonable possibility of intermediate examination, and with the knowledge that the absence of reasonable care in the preparation or putting up of the products will result in an injury to the consumer's life or property, owes a duty to the consumer to take that reasonable care.


I chained this summary into RAKE to run a quick keyword extraction over the summary. The RAKE parameters were as follows:

rake_object = rake.Rake("smartstoplist.txt", 5, 3, 4)

The output was a spot on extraction:

[('reasonable care', 4.0), ('consumer', 1.3333333333333333), ('products', 1.0)]