Load JSON files into Elasticsearch

The following example provides a simple guide for loading JSON files into Elasticsearch using the official elasticsearch API in Python.

Import dependencies

import requests, json, os
from elasticsearch import Elasticsearch

Set the path to the directory containing the JSON files to be loaded

directory = '/path/to/files/'

Connect to the Elasticsearch server

By default, the Elasticsearch instance will listen on port 9200

res = requests.get('http://localhost:9200')
print (res.content)
es = Elasticsearch([{'host': 'localhost', 'port': '9200'}])

Create an index value object

Because I want Elasticsearch to use a bog-standard integer at the unique _id for each document being loaded, I'm setting this up now outside the for loop I'm going to use to interate over the JSON files for loading into Elasticsearch

i = 1

Iterate over each JSON file and load it into Elasticsearch

for filename in os.listdir(directory):
    if filename.endswith(".json"):
        f = open(filename)
        docket_content = f.read()
        # Send the data into es
        es.index(index='myindex', ignore=400, doc_type='docket', 
        id=i, body=json.loads(docket_content))
        i = i + 1

There are a few things worth pointing out here:

  • index= is the name of the index we're creating, this can be anything you like
  • ignore=400 is flagging that I want to loader to ignore instances in which Elasticsearch is complaining about the format of any of the fields in the source JSON data (date fields, I get the feeling, are a commom offender here). I just want to Elasticsearch to receive the data as is without second guessing it.
  • doc_type is just a label we're assigning to each document being loaded
  • id=i is the unique index value being assigned to each document as it's loaded. You can leave this out as Elasticsearch will apply its own id.

Convert XML to JSON

I recently needed to converse a large collection of fairly complex XML files into JSON format to make them ready for leading into an Elasticsearch cluster.

After a bit of searching around, I found a really nice Python library called xmltodict, which worked perfectly.

You can download xmltodict using pip:

$ pip3 install xmltodict

xmltodict worked well with all the XML files I threw at it, so I'm not going to talk about the form and structure of the XML I was working with. The code below merely assumes that you have directory consisting of XML files and that you'd like to convert them to JSON format and write them to a new JSON file.

import xmltodict, json
import os

#path to the folder holding the XML
directory = '/path/to/folder'

#iterate over the XML files in the folder
for filename in os.listdir(directory):
    if filename.endswith(".xml"):
        f = open(filename)

        XML_content = f.read()

        #parse the content of each file using xmltodict
        x = xmltodict.parse(XML_content)
        j = json.dumps(x)

        print (filename)

        filename = filename.replace('.xml', '')
        output_file = open(filename + '.json', 'w')



Convert JSON to CSV with plain Javascript

I've been exploring the excellent API provided by the Canadian case law database, CanLII, and needed to quickly convert the JSON I was puling back to CSV. 

The following code, which has been tailored to suit the structure of the data coming back from the API, got the job done:

// Include dependencies

var json2csv = require('json2csv');
var fs = require('fs');

// Set up columns in the CSV
var fields = ['databaseId', 'caseId.en', 'title', 'citation'];

// Give it the data

var cases = [
        "databaseId": "csc-scc",
        "caseId": {
        "en": "2016scc56"
        "title": "Canada (Attorney General) v. Fairmont Hotels Inc.",
        "citation": "2016 SCC 56 (CanLII)"
        "databaseId": "csc-scc",
        "caseId": {
        "en": "2016scc55"
        "title": "Jean Coutu Group (PJC) Inc. v. Canada (Attorney General)",
        "citation": "2016 SCC 55 (CanLII)"

var csv = json2csv({ data: cases, fields: fields });

fs.writeFile('cases.csv', csv, function(err) {
    if (err) throw err;
        console.log('Case list converted and saved as CSV!!');

Prettifying JSON

Useful resources for pretty-printing and uglifying JSON at the command line can be found here

A particularly useful method is to use Ruby's ppjson.

To use ppjson:

1. Install ppjson at the command line:

gem install ppjson

2. To pretty print the JSON and write the prettified version back to the file (as opposed to having it merely pour into the terminal console), run:

ppjson -fi abc123.json