Newspaper is an excellent Python module used for extracting and parsing newspaper articles. I took the module for a very quick test drive today and wanted to document my initial findings, primarily as an aide memoir.
Assuming Newspaper is installed as a Python module (in my case I'm using Newspaper3k on Python3), start off by importing the module:
Set the target paper
In my test, I wanted to look at articles published in the Law section of the Guardian. The first step was to build the newspaper object, like so:
law = newspaper.build('https://www.theguardian.com/law')
To check the target was working, I passed size(), which gave an output of 438 articles.
Extract an article
For test purposes, I just wanted to extract a recent article, using the following code (this technically pulls down the second most recent article rather than the first, but somewhat confusingly, the result appears to be the most recent piece anyway!) :
first_article = law.articles
The first line stores the first article in a variable called first_article. The second line downloads the article stored in that variable.
Printing the result with print(first_article.html) just spews out the entire HTML to the console, which isn't very helpful. But, the brilliant thing about Newspaper is that it allows us to parse to article and then run some simple natural language processing against it.
Parse the article
Now that we've downloaded the article, we're in a position to parse it:
This in turn enables us to target specific sections of the article, like the body text, title or author. Here's how to scrape body text:
This will print only the body text to the console.
Write the body text to a file
The body text isn't that helpful to us sitting there in the console output, so let's write the output of first_article.text to a file:
First off, import sys
f = open( 'article.txt', 'w')