Data

Convert text files to lower case in Python

When mining text it often helps to convert the entire text to lower case as part of your pre-processing stage. Here's an example of a Python script that does just that with a directory of files consisting of one or many text files:

import os
from itertools import chain
from glob import glob

directory = '/path/to/the/directory'

for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        f = open(filename, 'r')
        text = f.read()
        
        lines = [text.lower() for line in filename]
        with open(filename, 'w') as out:
            out.writelines(lines)

Overriding SSL verification in Python Launcher

I was running into a frustrating issue when trying to download a bunch of nltk test corpora in Python Launcher: the launcher kept saying it could verify my SSL certificate, which meant that I was unable to download the materials I wanted. 

It turns out that the way around this problem is as follows:

1. Create an unverified SSL context

>>> import ssl
>>> ssl._create_default_https_context = ssl._create_unverified_context

2. Then run the download

>>> import nltk
>>> nltk.download()

3. Download the packages in Launcher

 

 

Text Summarisation with Gensim

A quick rundown of summarising texts with Gensim in Python3. 

Ensure the gensim module is installed. Here's the code to summarise a single text file:

from gensim.summarization import summarize 
import logging 

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 

f = open("telegraph.txt","r") 
text = f.read() 
print(summarize(text)) 
print(summarize(text, word_count=100)) 
print(summarize(text, ratio=0.5))

Test output of Donoghue v Stevenson with the output summary constrained to 100 words (I personally think the summariser has done an excellent job - it's calculated that the final paragraph of Lord Atkin's speech provides the best summary of the judgment!!):

My Lords, if your Lordships accept the  view that this pleading discloses a relevant cause of action you will be affirming the proposition that by Scots and English law alike a manufacturer of products, which he sells in such a form as to show that he intends them to reach the ultimate consumer in the form in which they left him with no reasonable possibility of intermediate examination, and with the knowledge that the absence of reasonable care in the preparation or putting up of the products will result in an injury to the consumer's life or property, owes a duty to the consumer to take that reasonable care.

Postscript

I chained this summary into RAKE to run a quick keyword extraction over the summary. The RAKE parameters were as follows:

rake_object = rake.Rake("smartstoplist.txt", 5, 3, 4)

The output was a spot on extraction:

[('reasonable care', 4.0), ('consumer', 1.3333333333333333), ('products', 1.0)]

Newspaper on Python

Newspaper is an excellent Python module used for extracting and parsing newspaper articles. I took the module for a very quick test drive today and wanted to document my initial findings, primarily as an aide memoir.

Import Newspaper

Assuming Newspaper is installed as a Python module (in my case I'm using Newspaper3k on Python3), start off by importing the module:

import Newspaper

Set the target paper

In my test, I wanted to look at articles published in the Law section of the Guardian. The first step was to build the newspaper object, like so:

law = newspaper.build('https://www.theguardian.com/law')

To check the target was working, I passed size(), which gave an output of 438 articles.

Extract an article

For test purposes, I just wanted to extract a recent article, using the following code (this technically pulls down the second most recent article rather than the first, but somewhat confusingly, the result appears to be the most recent piece anyway!) :

first_article = law.articles[1] 
first_article.download()

The first line stores the first article in a variable called first_article. The second line downloads the article stored in that variable. 

Printing the result with print(first_article.html) just spews out the entire HTML to the console, which isn't very helpful. But, the brilliant thing about Newspaper is that it allows us to parse to article and then run some simple natural language processing against it. 

Parse the article

Now that we've downloaded the article, we're in a position to parse it:

first_article.parse()

This in turn enables us to target specific sections of the article, like the body text, title or author. Here's how to scrape body text:

print(first_article.text)

This will print only the body text to the console. 

Write the body text to a file

The body text isn't that helpful to us sitting there in the console output, so let's write the output of first_article.text to a file:

First off, import sys

import sys

Then,

f = open( 'article.txt', 'w')
f.write(first_article.text)

Done!

Basic Text Mining in R

I've decided to spend some time exploring text mining using the R programming language.

For anyone interested in playing around with text mining in R, this is a great tutorial to work through. Obviously, you'll require R, which can be downloaded from the R site here.

Additional Resources (added as an when I come across them)

  1. A Survival Guide to Data Science with R, Graham Williams (http://togaware.com/onepager/)
  2. A Gentle Introduction to Topic Modelling using R, Eight2Late (https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/)