This notebook demonstrates some basic techniques for the Python library pandas. Even though I first posted this a while ago I try to keep this notebook updated to ensure that it works with the latest version of pandas. Below you can see the version of pandas that it was last tested with.

Table of Contents

Versioning

import pandas as pd
print(pd.__version__)
1.3.4

Importing Data

DataFrames from CSV Files

In pandas, DataFrames are the primary structure for dealing with data. They provide indexed rows and columns of a dataset, much like a spreadsheet. There are many ways to get data into DataFrames. Perhaps the most common way of getting data into DataFrames is by importing CSV files.

shakespeare_path = 'C:/Users/Julius/Google Drive/JupyterNotebooks/data/shakespeare.csv'
df = pd.read_csv(shakespeare_path)
df
Name Year Category
0 Titus Andronicus 1592 Tragedy
1 The Comedy of Errors 1594 Comedy
2 Richard II 1595 History
3 Romeo and Juliet 1595 Tragedy
4 A Midsummer Night’s Dream 1595 Comedy
5 King John 1596 History
6 Julius Caesar 1599 Tragedy
7 Othello 1604 Tragedy
8 Macbeth 1606 Tragedy

Although importing from a CSV file is perhaps the most common way of getting data into a DataFrame, there are many alternatives.

There are a lot of options that you can do with read_csv. One in particular that I like is the ability to limit the number of rows you read, which allows for you to experiment with smaller DataFrames to make debugging easier.

df = pd.read_csv(shakespeare_path, nrows=2)
df
Name Year Category
0 Titus Andronicus 1592 Tragedy
1 The Comedy of Errors 1594 Comedy

DataFrames from Strings

Another way you can create a DataFrame is by using StringIO, which allows you to read a string as if it were a CSV file. It even allows for missing data.

from io import StringIO
csv_data = '''A,B,C,D
1.0, 2.0, 3.0, 4.0
5.0, 6.0,, 8.0
0.0, 11.0, 12.0,'''
df = pd.read_csv(StringIO(csv_data))
df
A B C D
0 1.0 2.0 3.0 4.0
1 5.0 6.0 NaN 8.0
2 0.0 11.0 12.0 NaN

DataFrames from .data Files

Machine learning datasets from the UCI Machine Learning Repository often have datasets with the .data extension. These can also be read using pd.read_csv.

splice_path = 'C:/Users/Julius/Google Drive/JupyterNotebooks/data/splice.data'
df = pd.read_csv(splice_path,
    names=["Class", "Instance", "Sequence"])
df.head()
Class Instance Sequence
0 EI ATRINS-DONOR-521 CCAGCTGCATCACAGGAGGCCAGCGAGCAGG...
1 EI ATRINS-DONOR-905 AGACCCGCCGGGAGGCGGAGGACCTGCAGGG...
2 EI BABAPOE-DONOR-30 GAGGTGAAGGACGTCCTTCCCCAGGAGCCGG...
3 EI BABAPOE-DONOR-867 GGGCTGCGTTGCTGGTCACATTCCTGGCAGGT...
4 EI BABAPOE-DONOR-2817 GCTCAGCCCCCAGGTCACCCAGGAACTGACGTG...

Note that you can also read files directly from the Internet without downloading them first.

df = pd.read_csv('https://web.stanford.edu/~hastie/ElemStatLearn/datasets/SAheart.data')
df.head()
row.names sbp tobacco ldl adiposity famhist typea obesity alcohol age chd
0 1 160 12.00 5.73 23.11 Present 49 25.30 97.20 52 1
1 2 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 1
2 3 118 0.08 3.48 32.28 Present 52 29.14 3.81 46 0
3 4 170 7.50 6.41 38.03 Present 51 31.99 24.26 58 1
4 5 134 13.60 3.50 27.78 Present 60 25.99 57.34 49 1

DataFrames from Lists

You can also create a DataFrame from a list. For a single list, you can put it directly in a DataFrame.

names = ['Titus Andronicus', 'The Comedy of Errors', 'Richard II',
       'Romeo and Juliet', 'A Midsummer Night’s Dream', 'King John',
       'Julius Caesar', 'Othello', 'Macbeth']
pd.DataFrame(names, columns=['Name'])
Name
0 Titus Andronicus
1 The Comedy of Errors
2 Richard II
3 Romeo and Juliet
4 A Midsummer Night’s Dream
5 King John
6 Julius Caesar
7 Othello
8 Macbeth

But if you’re combining multiple lists, you need to zip them first.

years = [1592, 1594, 1595, 1595, 1595, 1596, 1599, 1604, 1606]
categories = ['Tragedy', 'Comedy', 'History', 'Tragedy', 'Comedy', 'History',
       'Tragedy', 'Tragedy', 'Tragedy']
df = pd.DataFrame(list(zip(names, years, categories)), columns=['Name', 'Year', 'Category'])
df
Name Year Category
0 Titus Andronicus 1592 Tragedy
1 The Comedy of Errors 1594 Comedy
2 Richard II 1595 History
3 Romeo and Juliet 1595 Tragedy
4 A Midsummer Night’s Dream 1595 Comedy
5 King John 1596 History
6 Julius Caesar 1599 Tragedy
7 Othello 1604 Tragedy
8 Macbeth 1606 Tragedy

DataFrames from Dicts

A more common way to do it is through dictionaries

my_dict = {'Name' : names, 'Year' : years, 'Category' : categories}
my_dict
{'Name': ['Titus Andronicus',
  'The Comedy of Errors',
  'Richard II',
  'Romeo and Juliet',
  'A Midsummer Night’s Dream',
  'King John',
  'Julius Caesar',
  'Othello',
  'Macbeth'],
 'Year': [1592, 1594, 1595, 1595, 1595, 1596, 1599, 1604, 1606],
 'Category': ['Tragedy',
  'Comedy',
  'History',
  'Tragedy',
  'Comedy',
  'History',
  'Tragedy',
  'Tragedy',
  'Tragedy']}
df = pd.DataFrame(my_dict)
df
Name Year Category
0 Titus Andronicus 1592 Tragedy
1 The Comedy of Errors 1594 Comedy
2 Richard II 1595 History
3 Romeo and Juliet 1595 Tragedy
4 A Midsummer Night’s Dream 1595 Comedy
5 King John 1596 History
6 Julius Caesar 1599 Tragedy
7 Othello 1604 Tragedy
8 Macbeth 1606 Tragedy

Saving a DataFrame

There are many ways to save a DataFrame, including in a pickle, msgpack, CSV, and HDF5Store. They all follow similar syntax.

df.to_csv('shakespeare.csv')

There’s something I find a little weird about saving a loading DataFrames. It’s that if you save it as above and then load it, you’ll get an column.

df = pd.read_csv('shakespeare.csv')
df.head()
Unnamed: 0 Name Year Category
0 0 Titus Andronicus 1592 Tragedy
1 1 The Comedy of Errors 1594 Comedy
2 2 Richard II 1595 History
3 3 Romeo and Juliet 1595 Tragedy
4 4 A Midsummer Night’s Dream 1595 Comedy

There are two ways to avoid this. The first I recommend if your row names are just 0-4 (or any number) as they are above. In that case, when you save it, you want to save it like so:

df.to_csv('shakespeare.csv', index=False)
df = pd.read_csv('shakespeare.csv')
df.head()
Unnamed: 0 Name Year Category
0 0 Titus Andronicus 1592 Tragedy
1 1 The Comedy of Errors 1594 Comedy
2 2 Richard II 1595 History
3 3 Romeo and Juliet 1595 Tragedy
4 4 A Midsummer Night’s Dream 1595 Comedy

The second way is better in the case when you do have index names that you want to save, but it will still work in this case. What you want to do is save your DataFrame like you did at first:

df.to_csv('shakespeare.csv')

Then when you open it, pass index_col=0 like so:

df = pd.read_csv('shakespeare.csv', index_col=0)
df.head()
Unnamed: 0.1 Name Year Category
0 0 Titus Andronicus 1592 Tragedy
1 1 The Comedy of Errors 1594 Comedy
2 2 Richard II 1595 History
3 3 Romeo and Juliet 1595 Tragedy
4 4 A Midsummer Night’s Dream 1595 Comedy

Exploring Data

Displaying Parts of the DataFrame

OK, now that we have the data into a DataFrame, let’s explore it. To get a quick preview of the data, you can use head().

df = pd.read_csv(shakespeare_path)
df.head(3)
Name Year Category
0 Titus Andronicus 1592 Tragedy
1 The Comedy of Errors 1594 Comedy
2 Richard II 1595 History

To see a preview of the end, use tail(). If you want to see some columns at random, use sample()

df.sample(5)
Name Year Category
2 Richard II 1595 History
3 Romeo and Juliet 1595 Tragedy
6 Julius Caesar 1599 Tragedy
7 Othello 1604 Tragedy
0 Titus Andronicus 1592 Tragedy

Or you can slice the DataFrame

df[5:]
Name Year Category
5 King John 1596 History
6 Julius Caesar 1599 Tragedy
7 Othello 1604 Tragedy
8 Macbeth 1606 Tragedy
df[0:5:2]
Name Year Category
0 Titus Andronicus 1592 Tragedy
2 Richard II 1595 History
4 A Midsummer Night’s Dream 1595 Comedy

You can also select by column. A single column of a Pandas DataFrame is known as a Pandas Series.

type(df['Name'])
pandas.core.series.Series

You can also make selections by inequalities, such as finding all the points where a column has a value greater than a specific amount.

df[df['Year'] > 1600]
Name Year Category
7 Othello 1604 Tragedy
8 Macbeth 1606 Tragedy

You can also select a specific cell based on its label.

df.loc[1, 'Name']
'The Comedy of Errors'

Built-in Descriptors

There are a few built-in descriptors that are good to know.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Name      9 non-null      object
 1   Year      9 non-null      int64 
 2   Category  9 non-null      object
dtypes: int64(1), object(2)
memory usage: 344.0+ bytes
df.shape
(9, 3)

For quantitative data, there are a few more worth knowing.

Quantitative Qata

You can also find the data type of each column

df.dtypes
Name        object
Year         int64
Category    object
dtype: object

You can do even more with quantitative data. You can find the mean of each quantitative column in the dataset (e.g. ints and floats)

df['Year'].mean()
1597.3333333333333

Or you can use describe to find even more.

df.describe()
Year
count 9.000000
mean 1597.333333
std 4.743416
min 1592.000000
25% 1595.000000
50% 1595.000000
75% 1599.000000
max 1606.000000

Sorting

Sorting is very simple with DataFrames.

df.sort_values('Name', ascending=True)
Name Year Category
4 A Midsummer Night’s Dream 1595 Comedy
6 Julius Caesar 1599 Tragedy
5 King John 1596 History
8 Macbeth 1606 Tragedy
7 Othello 1604 Tragedy
2 Richard II 1595 History
3 Romeo and Juliet 1595 Tragedy
1 The Comedy of Errors 1594 Comedy
0 Titus Andronicus 1592 Tragedy

Searching for Text Within a DataFrame

df['Name'].str.contains('Julius')
0    False
1    False
2    False
3    False
4    False
5    False
6     True
7    False
8    False
Name: Name, dtype: bool

If you want to fond both uppper and lower case examples, you can pass (?i) to the regex parser to tell it to ignore cases.

df['Name'].str.contains('(?i)julius')
0    False
1    False
2    False
3    False
4    False
5    False
6     True
7    False
8    False
Name: Name, dtype: bool

Splitting by Strings

If you want to split on the first element, you have to reference str again.

df['Name'].str.split(' ').str[0]
0      Titus
1        The
2    Richard
3      Romeo
4          A
5       King
6     Julius
7    Othello
8    Macbeth
Name: Name, dtype: object

Grouping and Counting by Label

df['Category'].groupby(df['Category']).count()
Category
Comedy     2
History    2
Tragedy    5
Name: Category, dtype: int64

.loc vs .iloc

.loc searches for labels by name (aka label-based indexing), .iloc gets rows by index number (aka positional indexing)

For a DataFrame where the index matches the row number (i.e. it starts at 0 and doesn’t skip any values), there is no difference between .loc and .iloc

df.loc[1]
Name        The Comedy of Errors
Year                        1594
Category                  Comedy
Name: 1, dtype: object
df.iloc[1]
Name        The Comedy of Errors
Year                        1594
Category                  Comedy
Name: 1, dtype: object

But once the index and row numbers become different, you have to be careful about which one to use.

new_df = df.drop(1)
new_df
Name Year Category
0 Titus Andronicus 1592 Tragedy
2 Richard II 1595 History
3 Romeo and Juliet 1595 Tragedy
4 A Midsummer Night’s Dream 1595 Comedy
5 King John 1596 History
6 Julius Caesar 1599 Tragedy
7 Othello 1604 Tragedy
8 Macbeth 1606 Tragedy
new_df.iloc[1]
Name        Richard II
Year              1595
Category       History
Name: 2, dtype: object

There is still an index of value 1, but there is no longer a label. So if we try .loc[1], we’ll get an error.

try:
    new_df.loc[1]
except KeyError:
    print("There is nothing with index 1")
There is nothing with index 1

If you want to update your indices you can do so with reset_index.

new_df.reset_index(drop=True)
Name Year Category
0 Titus Andronicus 1592 Tragedy
1 Richard II 1595 History
2 Romeo and Juliet 1595 Tragedy
3 A Midsummer Night’s Dream 1595 Comedy
4 King John 1596 History
5 Julius Caesar 1599 Tragedy
6 Othello 1604 Tragedy
7 Macbeth 1606 Tragedy

When you use .iloc, you can treat the results just like a numpy matrix. This means that trailing colons are always optional (just like numpy).

Cleaning Data

Bad values can come in many forms, including missing values, NaN, NA, ?, etc. Let’s go over how to find them

df = pd.read_csv(StringIO(csv_data))
df
A B C D
0 1.0 2.0 3.0 4.0
1 5.0 6.0 NaN 8.0
2 0.0 11.0 12.0 NaN

Now we can clean that dataset. One of the first things to look for is empty values

df.isnull()
A B C D
0 False False False False
1 False False True False
2 False False False True

This works for tiny datasets, but if it is larger we won’t be able to see them all, so we can just sum up all the True values

df.isnull().sum()
A    0
B    0
C    1
D    1
dtype: int64

Adding

To add a new column, simply declare it and make sure it has the correct number of rows.

df['E'] = df['A'] * df['B'] + 1
df
A B C D E
0 1.0 2.0 3.0 4.0 3.0
1 5.0 6.0 NaN 8.0 31.0
2 0.0 11.0 12.0 NaN 1.0

You can also add columns using conditional logic

import numpy as np
df['F'] = np.where(df['B'] > df['E'], 1, 0)
df
A B C D E F
0 1.0 2.0 3.0 4.0 3.0 0
1 5.0 6.0 NaN 8.0 31.0 0
2 0.0 11.0 12.0 NaN 1.0 1

Dropping

There are many ways to drop missing data. The basic dropna drops every row that has a missing value.

df.dropna()
A B C D E F
0 1.0 2.0 3.0 4.0 3.0 0

Note that this doesn’t drop the data from the original DataFrame, it just returns a new one without the dropped values.

Alternatively, you could drop all the columns that have a missing value

df.dropna(axis=1)
A B E F
0 1.0 2.0 3.0 0
1 5.0 6.0 31.0 0
2 0.0 11.0 1.0 1

But you may want more precision than that. One way to do this is by dropping specific rows or columns by name. To drop columns, use axis=1; to drop rows, use axis=0, or leave it out, because 0 is the default value

df.drop([0,2])
A B C D E F
1 5.0 6.0 NaN 8.0 31.0 0
df.drop(['D', 'C'], axis=1)
A B E F
0 1.0 2.0 3.0 0
1 5.0 6.0 31.0 0
2 0.0 11.0 1.0 1

Selecting specific columns is commonly used when preparing a dataset for machine learning. For example, you might select the features and labels like so

features = df.drop('A', axis=1)
labels = df['A']

You can also selective decide which rows to drop. You can decide to only drop rows where all columns are NaN.

df.dropna(how='all')
A B C D E F
0 1.0 2.0 3.0 4.0 3.0 0
1 5.0 6.0 NaN 8.0 31.0 0
2 0.0 11.0 12.0 NaN 1.0 1

Or drop all rows that have fewer than 4 non-NaN values

df.dropna(thresh=4)
A B C D E F
0 1.0 2.0 3.0 4.0 3.0 0
1 5.0 6.0 NaN 8.0 31.0 0
2 0.0 11.0 12.0 NaN 1.0 1

Or just drop rows where NaN appears in a specific column

df.dropna(subset=['D'])
A B C D E F
0 1.0 2.0 3.0 4.0 3.0 0
1 5.0 6.0 NaN 8.0 31.0 0

Finding Missing Values

You will need to clean up missing values before doing any machine learning on the data. Scikit-learn will give you an error if you try to run an algorithm on a dataset with missing values.

pandas has two commands to find missing values in a DataFrame, isna and isnull. These are 100% identical. The documentation literally says isnull = isna.

df.isnull()
A B C D E F
0 False False False False False False
1 False False True False False False
2 False False False True False False

Note that pandas will only call a certain type of data as missing. It’s possible that the user used a certain value to denote missing data but pandas doesn’t see it as missing. Below, there are many types of values that could have been intended to denote missing data, but pandas only sees the None value as missing.

a = [False, 0, 'na', 'NaN']
b = ['None', None, 'nan', 'NA']
df_missing = pd.DataFrame(list(zip(a,b)), columns=(['a','b']))
df_missing
a b
0 False None
1 0 None
2 na nan
3 NaN NA
df_missing.isnull()
a b
0 False False
1 False True
2 False False
3 False False

But when you import from a CSV, only the missing values will show up as missing.

bad_data = '''A,B,C,D
None, none, 3.0, 4.0
Missing, 6.0,, 'None'
0.0, NaN, False,'''

df2 = pd.read_csv(StringIO(bad_data))
df2
A B C D
0 None none 3.0 4.0
1 Missing 6.0 NaN 'None'
2 0.0 NaN False NaN
df2.isnull()
A B C D
0 False False False False
1 False False True False
2 False False False True

Interpolating

Sometimes deleting data isn’t the right approach. In these cases, you can interpolate the data. SKLearn has a good library for that.

from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')

It uses the same fit and transform approach as other SKLearn tools.

df
A B C D E F
0 1.0 2.0 3.0 4.0 3.0 0
1 5.0 6.0 NaN 8.0 31.0 0
2 0.0 11.0 12.0 NaN 1.0 1
df.isnull()
A B C D E F
0 False False False False False False
1 False False True False False False
2 False False False True False False
imp.fit(df)
imp.transform(df)
array([[ 1. ,  2. ,  3. ,  4. ,  3. ,  0. ],
       [ 5. ,  6. ,  7.5,  8. , 31. ,  0. ],
       [ 0. , 11. , 12. ,  6. ,  1. ,  1. ]])

Data Types

Data are usually split into three types: continuous, ordinal, and categorical. With continuous data the values associated with the data have specific meaning. For example, the price of a good is a continuous value. Common statistical analyses are usually appropriate for continuous values. For example, you can take the average of two continuous values and the result is a meaningful description of the data.

For ordinal data, the values are ordered, but they don’t necessarily have meaning. T-shirt size is a good example of this, where a small is less than a medium, but it’s not less by a specific value. If you converted “small” into a specific measurement (e.g. convert it to inches) it would become a continuous value.

The last type is categorical, and that’s where the differences aren’t associated with any ranking or order. For example, shirtstyle is a categorical variable. We might have “polo” and “T-shirt”, but neither is greater than the other. We could (and will) assign numbers to these categories for machine learning, but we need to remember that those numbers aren’t associated with values as they would be in continuous or ordinal data.

df = pd.DataFrame([
        ['green', 'M', 20, 'polo'],
        ['red', 'L', 15, 'T-shirt'],
        ['red', 'S', 15, 'polo']])
df.columns=['color', 'size', 'price', 'type']
df
color size price type
0 green M 20 polo
1 red L 15 T-shirt
2 red S 15 polo

Because size can be ranked, it is ordinal. We can map it to numbers to make it easier for machine learning.

size_mapping = {'S':1, 'M': 2, 'L':3}
df['size']=df['size'].map(size_mapping) # we can do a reverse mapping if we want to undo this in the end
df
color size price type
0 green 2 20 polo
1 red 3 15 T-shirt
2 red 1 15 polo

Even though the type is categorical, we can still map it to numbers

import numpy as np
class_mapping = {label:idx for idx, label in enumerate(np.unique(df['type']))}
df['type'] = df['type'].map(class_mapping)
df
color size price type
0 green 2 20 1
1 red 3 15 0
2 red 1 15 1

After we perform whatever analysis we want to on the data, we could then invert that mapping.

inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['type'] = df['type'].map(inv_class_mapping)
df
color size price type
0 green 2 20 polo
1 red 3 15 T-shirt
2 red 1 15 polo

Another way to do this is the LabelEncoder class in scikit-learn

from sklearn.preprocessing import LabelEncoder

class_le = LabelEncoder()
y = class_le.fit_transform(df['type'].values) #fit_transform is a shortcut for calling fit and transform separately
print(y)
class_le.inverse_transform(y)
[1 0 1]





array(['polo', 'T-shirt', 'polo'], dtype=object)

Preparing Data for Machine Learning

It’s a good idea to do one-hot encoding of categorical variables before using them for machine learning. It can also be a good idea for ordinal variables as well, although that’s not always the case. A good rule of thumb is if the mean of two values isn’t a meaningful value, that category should be one-hot encoded.

The downside of treating ordinal data as categorical is that we throw away information about the relative order. The downside of treating it as continuous data is that we introduce a notion of distance. For example, if we set “small” as “1”, “medium” as “2”, and “large” as “3”, we’re telling the model that a large is 3 times a small. This isn’t a meaningful thing to say, so it can reduce model performance.

Shuffling Data

pandas makes it very easy to shuffle data.

df.sample(frac=1).reset_index(drop=True)
color size price type
0 green 2 20 polo
1 red 3 15 T-shirt
2 red 1 15 polo

Separating Features from Labels

pandas also has a convenient way to extract the labels from a DataFrame, and that’s by using the pop method. It will remove the specified column from the DataFrame and put it into a Series of its own.

labels = df.pop('price')

You can see that it’s no longer in the DataFrame:

df.head()
color size type
0 green 2 polo
1 red 3 T-shirt
2 red 1 polo
labels
0    20
1    15
2    15
Name: price, dtype: int64

One-hot Encoding

pandas can also do one-hot encoding.

pd.get_dummies(df[['color', 'size']])
size color_green color_red
0 2 1 0
1 3 0 1
2 1 0 1

Other Tricks

Don’t Truncate

pandas will by default truncate text that is over a certain limit. When working with text data, sometimes you don’t want this. Here’s how to stop it.

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

pd.set_option is a nice tool. But if for some reason you don’t want to use that, you can also set it directly.

pd.options.display.max_columns = None
pd.options.display.max_rows = None

Transposing

You can switch the rows and columns of a DataFrame by transposing it by adding .T to the end.

df = pd.DataFrame(list(zip(names,years,categories)), columns=['Name', 'Year', 'Category'])
df
Name Year Category
0 Titus Andronicus 1592 Tragedy
1 The Comedy of Errors 1594 Comedy
2 Richard II 1595 History
3 Romeo and Juliet 1595 Tragedy
4 A Midsummer Night’s Dream 1595 Comedy
5 King John 1596 History
6 Julius Caesar 1599 Tragedy
7 Othello 1604 Tragedy
8 Macbeth 1606 Tragedy
df = df.T
df
0 1 2 3 4 5 6 7 8
Name Titus Andronicus The Comedy of Errors Richard II Romeo and Juliet A Midsummer Night’s Dream King John Julius Caesar Othello Macbeth
Year 1592 1594 1595 1595 1595 1596 1599 1604 1606
Category Tragedy Comedy History Tragedy Comedy History Tragedy Tragedy Tragedy