XML

Convert XML to JSON

I recently needed to converse a large collection of fairly complex XML files into JSON format to make them ready for leading into an Elasticsearch cluster.

After a bit of searching around, I found a really nice Python library called xmltodict, which worked perfectly.

You can download xmltodict using pip:

$ pip3 install xmltodict

xmltodict worked well with all the XML files I threw at it, so I'm not going to talk about the form and structure of the XML I was working with. The code below merely assumes that you have directory consisting of XML files and that you'd like to convert them to JSON format and write them to a new JSON file.

import xmltodict, json
import os

#path to the folder holding the XML
directory = '/path/to/folder'

#iterate over the XML files in the folder
for filename in os.listdir(directory):
    if filename.endswith(".xml"):
        f = open(filename)

        XML_content = f.read()

        #parse the content of each file using xmltodict
        x = xmltodict.parse(XML_content)
        j = json.dumps(x)

        print (filename)

        filename = filename.replace('.xml', '')
        output_file = open(filename + '.json', 'w')
        output_file.write(j)

 

 

Strip XML tags out of file

This is a quick and dirty example of using a regular expression to remove XML tags from an an XML file. 

Suppose we have the following XML, sample.xml:

<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

What we want to do is use Python to strip out the XML element tags, so that we're left with something like this:

Tove
Jani
Reminder
Don't forget me this weekend!

Here's how to do it:

import re

text = re.sub('<[^<]+>', "", open("sample.xml").read())
with open("output.txt", "w") as f:
    f.write(text)

Converting XML to CSV

XMLutils is a neat little Python package for converting XML to various file formats like CSV and JSON. The particularly useful thing about is that it can be executed from the command line, which makes it quick and easy to start using. 

Installation

I installed XMLutils at the command line using:

sudo easy_install XMLutils

Using XMLutils at the command line

I had a sample XML file that I wanted to convert to CSV format. There was a lot in the XML file that I didn't really want going into the output CSV file, so I executed the following command:

$ xml2csv --input "/Users/danielhoadley/Library/Mobile Documents/com~apple~CloudDocs/Documents/Development/Stuff/Dockets/10519[DK]Davies_v_Davies_(msb)(sub-bpw).xml" --output "test.csv" --tag "CaseInfo" --ignore "CaseMain" "AllNCit" "AllECLI" "TempIxCardNo" "FullReportName" "AltName" "CaseJoint_IxCardNo_TempIxCardNo_FullReportName_CaseName" "LegalTopics" "Reportability"
  • xml2csv invokes the converter
  • Declare the input XML file with --input followed by the path to the file
  • Declare the output CSV file with --output followed by the path to output file
  • Declare the XML node that represents a record in the input file (in my case, the node was CaseInfo)

Running a command like this will be sufficient to do a straight conversion to CSV:

$ xml2csv --input "/Users/danielhoadley/Library/Mobile Documents/com~apple~CloudDocs/Documents/Development/Stuff/Dockets/10519[DK]Davies_v_Davies_(msb)(sub-bpw).xml" --output "test.csv" --tag "CaseInfo"

However, as I've said, there was quite a lot in the input file that I wanted to ignore. Ignoring tags is pretty straightforward: simply declare the tags you want to ignore after that --ignore flag, e.g:

--ignore "CaseMain" "AllNCit" "AllECLI" "TempIxCardNo" "FullReportName" "AltName" "CaseJoint_IxCardNo_TempIxCardNo_FullReportName_CaseName" "LegalTopics" "Reportability"

Note

Remember to enclose the names of tags in quotes!