Pandas

Delete (drop) a column from a Pandas dataframe

Here's how to delete a column from a Pandas dataframe (assume we aleady have a dataframe, df):

df.drop('name of column', axis=1)

The axis=1 argument simply signals that we want to delete a column as opposed to a row.

To delete multiple columns in one go, just pass a list in, like this:

df.drop(['name of column', 'name of another column', axis=1]

Find rows containing specific values in a Pandas Dataframe

Suppose we have Pandas dataframe, df, with a column called Name.

The data looks in the column look like this:

Name
----
Statute of Westminster, The First (1275)
1275 c. 5
Statute of Marlborough 1267 [Waste]
1267 c. 23
The Statute of Marlborough 1267 [Distress]
1267 c. 1

If we wanted to exclude all rows that contain something like 1275 c.5 from the dataframe, we can use Pandas str.contains() method in combination with a regular expression, like so:

df = df[df['Statute_Name'].str.contains('\d{4}\s+([a-z]|[A-Z])') == False]

This is basically saying that df is now equal to all of the rows in df where a match on our regular expression returns False.

Setting == True would have the opposite effect; retaining rows that match the expressions and dropping those that do not.

Write contents of list to column in Pandas dataframe

Let's say we have the following list:

myList = ['Wolf', 'Fox', 'Shark']

To write the contents of myList to a column in a Pandas dataframe, we'd do the following:

import pandas as pd # Bring in Pandas

myList = ['Wolf', 'Fox', 'Shark'] 

# Create empty dataframe with a column named 'Animals'

df = pd.DataFrame(columns=['Animals'])

# Write myList to the Animals column

df['Animals'] = myList

Building a Scatterplot with Pandas and Seaborn

Pandas and Seaborn go together like lemon and lime. In the code below, we're using Pandas to construct a dataframe from a CSV file and Seaborn (which sits on top of matplotlib and makes it look a million times better) is handling the visualisation end of things.

The dataframe consists of three columns, passiveYearactiveYear and Vala where:

activeYear = the year of a case that considered an earlier case

passiveYear = the year of a case which has itself been considered

Vala = the type of consideration the active case meted out against the passive case.

Code

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Parse CSV, bring in the columns and drop null values
df = pd.read_csv('hoad.csv', usecols=['Vala', 'passiveYear', 'activeYear']).dropna()

# Build a grid consisting of a chart for each Vala type

grid = sns.FacetGrid(df, col="Vala", hue="Vala", col_wrap=3, size=3)

# Draw a horizontal line to show the rough midway point along the y axis 
grid.map(plt.axhline, y=1907, ls=":", c=".5")

# Plot where x=active year and y=passiveyear
grid.map(plt.scatter, "activeYear", "passiveYear", marker="o", alpha=0.5)

# Adjust the tick positions and labels
grid.set(xticks=[1800,2015], yticks=[1800,2015],
         xlim=(1955, 2015), ylim=(1800, 2015))

# Adjust the arrangement of the plots
grid.fig.tight_layout(w_pad=1)

plt.show()

This code yields the following visualisation:

Load a CSV as dataframe with Pandas

In days gone by, when it came to wrangling with tabular data, my first port of call would have been to load the data in Excel and slog it out for as long as it took. Now, I use Pandas to wrangle tabular data.

Having used Pandas for a while now, I've come to appreciate that dealing with larger quantities in data in Excel is made difficult by two things in particular. First, memory is an issue - try using Find and Replace to remove a few thousand commas and you'll notice that your computer begins to run out of puff. Second, Excel formulas aren't a patch on the power of the Python language, the power of which can be brought to bear on a Pandas dataframe.

Load a CSV as a dataframe

df = pd.read_csv('name_of_file.csv')

Load specific columns in the CSV as a dataframe

df = pd.read_csv('name_of_file', usecols=['ColumnName', 'AnotherColumnName', 'Etcetera'])

Drop NaN Values from the Dataframe

df = pd.read_csv('name_of_file', usecols=['ColumnName', 'AnotherColumnName', 'Etcetera']).dropna()

Load a CSV as a dataframe and specify a column as the index 

mydf = pd.read_csv('name_of_file.csv', index_col='ColumnName')