Scraping news websites and looking for specific words and phrases

This afternoon, my colleague and Transparency Project member, Paul Magrath, told me he was interested finding out whether there's a way of systematically watching out for a set of pre-defined "trigger words" of interest to the Transparency Project in online articles published by a selection of news organisations with a nasty habit of misreporting family court proceedings. 

I thought "that's a perfect job for Python" and sat down to write a basic proof of concept for Paul to take a look at. 

The code, which is here, iterates through an RSS feed on the Daily Mail's online site, reads each article by requesting the article link for each item in the feed and checks it for a list of pre-defined triggers (currently devised around an article about Myleene Klass, of all people). The output is generated back to a CSV file for review. 

Here's the GitHub repo. Statute Scraper

The following Python script can be used to scrape the full text of Public General Acts from


The script takes a list of URLs to individual pieces of legislation from a text file. The script processes each URL one by one. The text file needs to look something like this, with each target URL on a new line:


The Python script is simple enough. 

  • First, it opens url.txt and reads the target URLs line by line
  • For each target URL, the title of the legislation is captured (this is used to name the output files)
  • Each url is cycled through sequentially and the contents of the relevant part of the HTML markup is scraped
  • The scraped material is written to a text file and an prettified HTML file.

# Environment

import requests
import time
import io
from bs4 import BeautifulSoup
from urllib import urlopen

# Get the text file with the URLs to be scraped and scrape the target section in each page

print "\n\nScraping URLs in urls.txt...\n\n"

with open('urls.txt') as inf:
    # Get each url on each line in urls.txt
    urls = (line.strip() for line in inf)
    for url in urls:
        site = urlopen(url)
        soup = BeautifulSoup(site, "lxml")
        # Scrape the name of the legislation in each target url for use when saving the output to a file
        for legName in soup.find_all("h1", {"class": "pageTitle"}):
            actTitle = legName.text
                print 'Scraping ' + actTitle + ' ...\n'
        # Scrape stuff in <div id="viewLegContents"></div>
        for item in soup.find_all("div", {"id": "viewLegContents"}):
                # Write what we've scraped, with UTF-8 encoding, as text to a new text file - one file per url
                with (actTitle + '.txt', 'w', encoding='utf-8') as g:
                    # Write what we've scraped to an html file - one file per url
                    with open (actTitle + '.html', 'w') as g:

print "\n\nDone! Files created.\n"

Respect the source of the data you're scraping

The people behind have done everyone a big favour in making their information so accessible. The least we can do is to be respectful of their servers when performing scraping tasks like this. If you plan on running this, I'd strongly urge you to break your url.txt input into small chunks and, if you go on to reuse the data, remember to acknowledge the source of that data.