BeautifulSoup

Legislation.gov.uk Statute Scraper

The following Python script can be used to scrape the full text of Public General Acts from legislation.gov.uk.

Prerequisites

The script takes a list of URLs to individual pieces of legislation from a text file. The script processes each URL one by one. The text file needs to look something like this, with each target URL on a new line:

url.txt

http://www.legislation.gov.uk/ukpga/2016/4 
http://www.legislation.gov.uk/ukpga/2016/3 
http://www.legislation.gov.uk/ukpga/2016/2 
http://www.legislation.gov.uk/ukpga/2016/1 
http://www.legislation.gov.uk/ukpga/2017/1 
http://www.legislation.gov.uk/ukpga/2017/2

The Python script is simple enough. 

  • First, it opens url.txt and reads the target URLs line by line
  • For each target URL, the title of the legislation is captured (this is used to name the output files)
  • Each url is cycled through sequentially and the contents of the relevant part of the HTML markup is scraped
  • The scraped material is written to a text file and an prettified HTML file.

Scraper.py

# Environment

import requests
import time
import io
from bs4 import BeautifulSoup
from urllib import urlopen

# Get the text file with the URLs to be scraped and scrape the target section in each page

print "\n\nScraping URLs in urls.txt...\n\n"

with open('urls.txt') as inf:
    
    # Get each url on each line in urls.txt
    urls = (line.strip() for line in inf)
    for url in urls:
        site = urlopen(url)
        soup = BeautifulSoup(site, "lxml")
        
        # Scrape the name of the legislation in each target url for use when saving the output to a file
        
        for legName in soup.find_all("h1", {"class": "pageTitle"}):
            
            actTitle = legName.text
                
                print 'Scraping ' + actTitle + ' ...\n'
    
        # Scrape stuff in <div id="viewLegContents"></div>
        
        for item in soup.find_all("div", {"id": "viewLegContents"}):
            
                # Write what we've scraped, with UTF-8 encoding, as text to a new text file - one file per url
                
                with io.open (actTitle + '.txt', 'w', encoding='utf-8') as g:
                    g.write(item.text)
                    
                    # Write what we've scraped to an html file - one file per url
                    
                    with open (actTitle + '.html', 'w') as g:
                        g.write(item.prettify('utf-8'))

print "\n\nDone! Files created.\n"

Respect the source of the data you're scraping

The people behind legislation.gov.uk have done everyone a big favour in making their information so accessible. The least we can do is to be respectful of their servers when performing scraping tasks like this. If you plan on running this, I'd strongly urge you to break your url.txt input into small chunks and, if you go on to reuse the data, remember to acknowledge the source of that data.