MongoDB

Insert a dictionary into MongoDB

MongoDB and Python work so well together because Mongo's BSON document structure essentially mirrors a Python dictionary, where we have a range of key/value pairs. 

Here's a very simple example of how to insert a dictionary into MongoDB:

from pymongo import MongoClient

# Create connection to MongoDB
client = MongoClient('localhost', 27017)
db = client['name_of_database']
collection = db['name_of_collection']

# Build a basic dictionary
d = {'website': 'www.carrefax.com', 'author': 'Daniel Hoadley', 'colour': 'purple'}

# Insert the dictionary into Mongo
collection.insert(d)

Done.

Connect to MongoDB

This snippet demonstrates how to connect to a local instance of MongoDB.

#import pymongo
from pymongo import MongoClient

# Set up the client, by default MongoDB runs on port 27017
client = MongoClient('localhost', 27017)

# Set the name of the MongoDB database you want to connect to
db = client['name_of_database']

# Set the name of the collection within the database you want to connect to
collection = db['name_of_connection']

 

Create inverted index of sentences in text files

The purpose of this script is to identify instances in which any given sentence in any given document within a corpus appears in other documents within the corpus

For example, suppose we have a corpus of three text documents (document_xdocument_y and document_z). Each document consists of one or many sentences. 

Document_x consists of multiple sentences, one of which reads:

This sentence is about apples.

Another sentence in document_x reads:

However, unlike the other quoted sentence, this sentence is about oranges.

It just so happens that document_y contains the following sentence, amongst others:

This sentence is about apples.

We could construct an index where the sentence is the key and the files in which it appears is the value, for example:

Sentence Files
This sentence is about apples. document_x, document_y
However, unlike the other quoted sentence, this sentence is about oranges. document_x

The following code example performs this indexing task. For an added twist, the resulting dictionaries are written to a MongoDB database on my local machine.

## (c) 2017. Daniel Hoadley
## Tokenize text files into sentences and writes the output to MongoDB as key(sentence)/value(source filename) pairs.
## Start mongodb instance first with ./mongod

import json
from pymongo import MongoClient
import nltk.data
import codecs
import os
from nltk.tokenize import sent_tokenize

## Run clean.py before executing this script.

## Connect to MongoDB instance and create new database/collection

client = MongoClient('localhost', 27017)
db = client['test-database']
collection = db['sentences']

# Create empty dictionary object

d = {}

# Read the source files

directory = '/Users/danielhoadley/Documents/Development/Python/regex'

for filename in os.listdir(directory):
    
    if filename.endswith('.cln'):
        
        source = codecs.open(filename, 'r', 'utf-8')
        content = source.read()
        name = source.name

# Tokenise the source file into sentences

        tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
        sents = sent_tokenize(content)
        print content
        print sents
        
# Deduplicate the list of sentences to remove instances where a sentence appears multiple times in the same case
        deduped_sents = list(set(sents))

# Clean the list because mongo doesn't like fullpoints in the key

        clean_sents = map(lambda each:each.strip(u'.'),deduped_sents)
        fresh_sents = map(lambda each:each.strip(),clean_sents)
        cleaned = [word.replace(':', '.') for word in fresh_sents]

        # Populate the empty dictionary with the sentences as keys and the filename as a value

        for i in cleaned:
            d.setdefault(i, []).append(name)

# Remove keys that are less than 50 characters in length

for k in d.keys():
    if len(k) <= 50:
        del d[k]

# Iterate over the dictionary and write each key/value pair to MongoDB as an object

for key, value in d.iteritems():
    sentence_id = db.sentences.insert_one({'sentence': key, 'files': value})

# Dump the output to the console so I can eyeball it

print json.dumps(d.items(), sort_keys=True, indent=4) # output the dictionary as prettified json

print '\nSentences extracted and written to MongoDB!\n'

MongoDB and Node.js example

The following code provides a brief reference example of reading data from a MongoDB database using Node.js. 

Prerequisties

The following is already in place:

  • I've created a database in MongoDB called beetleJuice
  • Within the beetleJuice database, I've created a collection called bugs
  • The mongo dependency has been saved to the project directory
  • I have an instance of MongoDB running on port 27017
// Bring in the MongoDB dependency

var MongoClient = require('mongodb').MongoClient, assert = require('assert');

// Connect to the database

MongoClient.connect('mongodb://localhost:27017/beetleJuice', function (err, db) {
    
    assert.equal(null, err);
    
    // assign the bugs collection to var col
    
    var col = db.collection('bugs');
    
    // use the findOne method to search for a document where assignee is set to Daniel Hoadley
    
    col.findOne({"assignee" : "Daniel Hoadley"}, function (err, doc) {
        
        assert.equal(null, err);
        
    // Print the resulting document to the console
        
        console.log("Here is my doc: %j", doc);
        
    // Close the connection to the database
        
        db.close();
    })
})