Using Scikit-Learn to classify your own text data (the short version)

Last month I posted a lengthy article on how to use Scikit-Learn to build a cross-validated classification model on your own text data. The purpose of that article was to provide an entry point for new Scikit-Learn users who wanted to move away from using the built-in datasets (like twentynewsgroups) and focus on their own corpora.

I thought it might be useful to post a condensed version of the longer read for people who wanted to skip over the explanatory material and get started with the code.

As before, the objective of the code is as follows. We have a dataset consisting of multiple directories, each containing n text files. Each directory name acts as a descriptive category label for the files contained within (e.g. technology, finance, food). We're going to use this data to build a classifier capable of recieving new, unlabeled text data and assigning it to the best fitting category.

The code

import sklearn
import numpy as np
from glob import glob
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.externals import joblib

Get paths to labelled data

rawFolderPaths = glob("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/*/")

print ('\nGathering labelled categories...\n')

categories = []

Extract the folder paths, reduce down to the label and append to the categories list

for i in rawFolderPaths:
    string1 = i.replace('/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML/','')
    category = string1.strip('/')
    #print (category)

Load the data

print ('\nLoading the dataset...\n')
docs_to_train = sklearn.datasets.load_files("/Users/danielhoadley/PycharmProjects/trainer/!labelled_data_reportXML",description=None, categories=categories, load_content=True, encoding='utf-8', shuffle=True, random_state=42)

Split the dataset into training and testing sets

print ('\nBuilding out hold-out test sample...\n')
X_train, X_test, y_train, y_test = train_test_split(,, test_size=0.4)

Transform the training data into tfidf vectors

print ('\nTransforming the training data...\n')
count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(raw_documents=X_train)

tfidf_transformer = TfidfTransformer(use_idf=False)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print (X_train_tfidf.shape)

Transform the test data into tfidf vectors

print ('\nTransforming the test data...\n')
count_vect = CountVectorizer(stop_words='english')
X_test_counts = count_vect.fit_transform(raw_documents=X_test)

tfidf_transformer = TfidfTransformer(use_idf=False)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
print (X_test_tfidf.shape)

print (X_test_tfidf)
print (y_train.shape)

docs_test = X_test

Construct the classifier pipeline using a SGDClassifier algorithm

print ('\nApplying the classifier...\n')
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                     ('tfidf', TfidfTransformer(use_idf=True)),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                      alpha=1e-3, random_state=42, verbose=1)),

Fit the model to the training data, y_train)

Run the test data into the model

predicted = text_clf.predict(docs_test)

Calculate mean accuracy of predictions

print (np.mean(predicted == y_test))

Generate labelled performance metrics

print(metrics.classification_report(y_test, predicted,