Data Wrangling

Delete (drop) a column from a Pandas dataframe

Here's how to delete a column from a Pandas dataframe (assume we aleady have a dataframe, df):

df.drop('name of column', axis=1)

The axis=1 argument simply signals that we want to delete a column as opposed to a row.

To delete multiple columns in one go, just pass a list in, like this:

df.drop(['name of column', 'name of another column', axis=1]

Converting Dates in Python

Dates are one of those annoying things that can be expressed in many different formats. For example, the date at the time of writing is 28 June 2017. I can express today's date in any number of ways, including:

  • June 28, 2017 
  • 28/06/2017
  • 28/06/17
  • 28/6/17
  • 2017-06-28

When you're working with dates in your Python projects, chances are you'll eventually need to wrangle a date in one format into another, so here's an example I came across in my own code recently.

I was parsing RSS feeds published by the British and Irish Legal Information Institute. In the feeds, dates are expressed as, for example, '28 June 2017'. For my purposes, I needed to convert the date into YYYY-mm-dd format (e.g. 2017-06-28). Here's how I dealt with it:

I was capturing the date like so:

date_in_feed = '28 June 2017'

I then set up the converter, which uses strptime, passing in the date I've captured as the first argument and it's format as the second argument:

converter = time.strptime(date_in_feed, '%d %B %Y')

Finally, to get the date the way I want, I use strpftime, passing in the desired format as the first argument and my converted (above) as the second argument.

converted_date = time.strftime('%Y-%m-%d', converter)

In the light of the myriad of alternative date structures you may be dealing with, I think you're best off looking at the datetime documentation here, but I'll add more examples as and when I come across them.

Load a CSV as dataframe with Pandas

In days gone by, when it came to wrangling with tabular data, my first port of call would have been to load the data in Excel and slog it out for as long as it took. Now, I use Pandas to wrangle tabular data.

Having used Pandas for a while now, I've come to appreciate that dealing with larger quantities in data in Excel is made difficult by two things in particular. First, memory is an issue - try using Find and Replace to remove a few thousand commas and you'll notice that your computer begins to run out of puff. Second, Excel formulas aren't a patch on the power of the Python language, the power of which can be brought to bear on a Pandas dataframe.

Load a CSV as a dataframe

df = pd.read_csv('name_of_file.csv')

Load specific columns in the CSV as a dataframe

df = pd.read_csv('name_of_file', usecols=['ColumnName', 'AnotherColumnName', 'Etcetera'])

Drop NaN Values from the Dataframe

df = pd.read_csv('name_of_file', usecols=['ColumnName', 'AnotherColumnName', 'Etcetera']).dropna()

Load a CSV as a dataframe and specify a column as the index 

mydf = pd.read_csv('name_of_file.csv', index_col='ColumnName')