This post demonstrates some basic techniques for the Python library pandas.

Table of Contents

Version

I rerun this code every once in a while to ensure it’s up-to-date. Here’s the latest version it was tested on:

import pandas as pd
from pathlib import Path
print(pd.__version__)

1.4.4

Importing Data

DataFrames from CSV Files

In pandas, DataFrames are the primary structure for dealing with data. They provide indexed rows and columns of a dataset, much like a spreadsheet. There are many ways to get data into DataFrames. Perhaps the most common way of getting data into DataFrames is by importing CSV files.

shakespeare_path = Path("data") / "shakespeare.csv"

df = pd.read_csv(shakespeare_path)

df

	Name	Year	Category
0	Titus Andronicus	1592	Tragedy
1	The Comedy of Errors	1594	Comedy
2	Richard II	1595	History
3	Romeo and Juliet	1595	Tragedy
4	A Midsummer Night’s Dream	1595	Comedy
5	King John	1596	History
6	Julius Caesar	1599	Tragedy
7	Othello	1604	Tragedy
8	Macbeth	1606	Tragedy

Although importing from a CSV file is perhaps the most common way of getting data into a DataFrame, there are many alternatives.

There are a lot of options that you can do with read_csv. One in particular that I like is the ability to limit the number of rows you read, which allows for you to experiment with smaller DataFrames to make debugging easier.

df = pd.read_csv(shakespeare_path, nrows=2)

df

	Name	Year	Category
0	Titus Andronicus	1592	Tragedy
1	The Comedy of Errors	1594	Comedy

DataFrames from Strings

Another way you can create a DataFrame is by using StringIO, which allows you to read a string as if it were a CSV file. It even allows for missing data.

from io import StringIO

csv_data = """A,B,C,D
1.0, 2.0, 3.0, 4.0
5.0, 6.0,, 8.0
0.0, 11.0, 12.0,"""
df = pd.read_csv(StringIO(csv_data))

df

	A	B	C	D
0	1.0	2.0	3.0	4.0
1	5.0	6.0	NaN	8.0
2	0.0	11.0	12.0	NaN

DataFrames from .data Files

Machine learning datasets from the UCI Machine Learning Repository often have datasets with the .data extension. These can also be read using pd.read_csv.

splice_path = "C:/Users/Julius/Google Drive/JupyterNotebooks/data/splice.data"
df = pd.read_csv(splice_path, names=["Class", "Instance", "Sequence"])

df.head()

	Class	Instance	Sequence
0	EI	ATRINS-DONOR-521	CCAGCTGCATCACAGGAGGCCAGCGAGCAGG...
1	EI	ATRINS-DONOR-905	AGACCCGCCGGGAGGCGGAGGACCTGCAGGG...
2	EI	BABAPOE-DONOR-30	GAGGTGAAGGACGTCCTTCCCCAGGAGCCGG...
3	EI	BABAPOE-DONOR-867	GGGCTGCGTTGCTGGTCACATTCCTGGCAGGT...
4	EI	BABAPOE-DONOR-2817	GCTCAGCCCCCAGGTCACCCAGGAACTGACGTG...

Note that you can also read files directly from the Internet without downloading them first.

df = pd.read_csv("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/SAheart.data")
df.head()

	row.names	sbp	tobacco	ldl	adiposity	famhist	typea	obesity	alcohol	age	chd
0	1	160	12.00	5.73	23.11	Present	49	25.30	97.20	52	1
1	2	144	0.01	4.41	28.61	Absent	55	28.87	2.06	63	1
2	3	118	0.08	3.48	32.28	Present	52	29.14	3.81	46	0
3	4	170	7.50	6.41	38.03	Present	51	31.99	24.26	58	1
4	5	134	13.60	3.50	27.78	Present	60	25.99	57.34	49	1

DataFrames from Lists

You can also create a DataFrame from a list. For a single list, you can put it directly in a DataFrame.

names = [
    "Titus Andronicus",
    "The Comedy of Errors",
    "Richard II",
    "Romeo and Juliet",
    "A Midsummer Night’s Dream",
    "King John",
    "Julius Caesar",
    "Othello",
    "Macbeth",
]
pd.DataFrame(names, columns=["Name"])

	Name
0	Titus Andronicus
1	The Comedy of Errors
2	Richard II
3	Romeo and Juliet
4	A Midsummer Night’s Dream
5	King John
6	Julius Caesar
7	Othello
8	Macbeth

But if you’re combining multiple lists, you need to zip them first.

years = [1592, 1594, 1595, 1595, 1595, 1596, 1599, 1604, 1606]
categories = ["Tragedy", "Comedy", "History", "Tragedy", "Comedy", "History", "Tragedy", "Tragedy", "Tragedy"]

df = pd.DataFrame(list(zip(names, years, categories)), columns=["Name", "Year", "Category"])
df

	Name	Year	Category
0	Titus Andronicus	1592	Tragedy
1	The Comedy of Errors	1594	Comedy
2	Richard II	1595	History
3	Romeo and Juliet	1595	Tragedy
4	A Midsummer Night’s Dream	1595	Comedy
5	King John	1596	History
6	Julius Caesar	1599	Tragedy
7	Othello	1604	Tragedy
8	Macbeth	1606	Tragedy

DataFrames from Dicts

A more common way to do it is through dictionaries

my_dict = {"Name": names, "Year": years, "Category": categories}

my_dict

{'Name': ['Titus Andronicus',
  'The Comedy of Errors',
  'Richard II',
  'Romeo and Juliet',
  'A Midsummer Night’s Dream',
  'King John',
  'Julius Caesar',
  'Othello',
  'Macbeth'],
 'Year': [1592, 1594, 1595, 1595, 1595, 1596, 1599, 1604, 1606],
 'Category': ['Tragedy',
  'Comedy',
  'History',
  'Tragedy',
  'Comedy',
  'History',
  'Tragedy',
  'Tragedy',
  'Tragedy']}

df = pd.DataFrame(my_dict)
df

	Name	Year	Category
0	Titus Andronicus	1592	Tragedy
1	The Comedy of Errors	1594	Comedy
2	Richard II	1595	History
3	Romeo and Juliet	1595	Tragedy
4	A Midsummer Night’s Dream	1595	Comedy
5	King John	1596	History
6	Julius Caesar	1599	Tragedy
7	Othello	1604	Tragedy
8	Macbeth	1606	Tragedy

If you have a simple dictionary like this: d = {'a': 1, 'b': 2}, you can’t just put it in a pd.DataFrame, because it’ll give you a ValueError for not passing an index. Instead, you can do this: pd.DataFrame(list(d.items())). You could also pass column names so it looks more like this: pd.DataFrame(list(d.items()), columns=['Key', 'Value']).

You might not want the dictionary keys, in which can you can do this: pd.DataFrame(list(building_dict.values()))

Saving a DataFrame

There are many ways to save a DataFrame, including in a pickle, msgpack, CSV, and HDF5Store. They all follow similar syntax.

df.to_csv("shakespeare.csv")

There’s something I find a little weird about saving a loading DataFrames. It’s that if you save it as above and then load it, you’ll get an extra column.

df = pd.read_csv("shakespeare.csv")

df.head()

	Unnamed: 0	Name	Year	Category
0	0	Titus Andronicus	1592	Tragedy
1	1	The Comedy of Errors	1594	Comedy
2	2	Richard II	1595	History
3	3	Romeo and Juliet	1595	Tragedy
4	4	A Midsummer Night’s Dream	1595	Comedy

There are two ways to avoid this. The first I recommend if your row names are just 0-4 (or any number) as they are above. In that case, when you save it, you want to save it like so:

df.to_csv("shakespeare.csv", index=False)

df = pd.read_csv("shakespeare.csv")

df.head()

	Unnamed: 0	Name	Year	Category
0	0	Titus Andronicus	1592	Tragedy
1	1	The Comedy of Errors	1594	Comedy
2	2	Richard II	1595	History
3	3	Romeo and Juliet	1595	Tragedy
4	4	A Midsummer Night’s Dream	1595	Comedy

The second way is better in the case when you do have index names that you want to save, but it will still work in this case. What you want to do is save your DataFrame like you did at first:

df.to_csv("shakespeare.csv")

Then when you open it, pass index_col=0 like so:

df = pd.read_csv("shakespeare.csv", index_col=0)

df.head()

	Unnamed: 0	Name	Year	Category
0	0	Titus Andronicus	1592	Tragedy
1	1	The Comedy of Errors	1594	Comedy
2	2	Richard II	1595	History
3	3	Romeo and Juliet	1595	Tragedy
4	4	A Midsummer Night’s Dream	1595	Comedy

Exploring Data

Displaying Parts of the DataFrame

OK, now that we have the data into a DataFrame, let’s explore it. To get a quick preview of the data, you can use head().

df = pd.read_csv(shakespeare_path)
df.head(3)

	Name	Year	Category
0	Titus Andronicus	1592	Tragedy
1	The Comedy of Errors	1594	Comedy
2	Richard II	1595	History

To see a preview of the end, use tail(). If you want to see some columns at random, use sample()

df.sample(5)

	Name	Year	Category
1	The Comedy of Errors	1594	Comedy
3	Romeo and Juliet	1595	Tragedy
4	A Midsummer Night’s Dream	1595	Comedy
2	Richard II	1595	History
7	Othello	1604	Tragedy

Or you can slice the DataFrame

df[5:]

	Name	Year	Category
5	King John	1596	History
6	Julius Caesar	1599	Tragedy
7	Othello	1604	Tragedy
8	Macbeth	1606	Tragedy

df[0:5:2]

	Name	Year	Category
0	Titus Andronicus	1592	Tragedy
2	Richard II	1595	History
4	A Midsummer Night’s Dream	1595	Comedy

You can also select by column. A single column of a Pandas DataFrame is known as a Pandas Series.

type(df["Name"])

pandas.core.series.Series

You can also make selections by inequalities, such as finding all the points where a column has a value greater than a specific amount.

df[df["Year"] > 1600]

	Name	Year	Category
7	Othello	1604	Tragedy
8	Macbeth	1606	Tragedy

You can also select a specific cell based on its label.

df.loc[1, "Name"]

'The Comedy of Errors'

Built-in Descriptors

There are a few built-in descriptors that are good to know.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Name      9 non-null      object
 1   Year      9 non-null      int64 
 2   Category  9 non-null      object
dtypes: int64(1), object(2)
memory usage: 344.0+ bytes

df.shape

(9, 3)

For quantitative data, there are a few more worth knowing.

Quantitative Data

You can also find the data type of each column

df.dtypes

Name        object
Year         int64
Category    object
dtype: object

You can do even more with quantitative data. You can find the mean of each quantitative column in the dataset (e.g. ints and floats)

df["Year"].mean()

1597.3333333333333

Or you can use describe to find even more.

df.describe()

	Year
count	9.000000
mean	1597.333333
std	4.743416
min	1592.000000
25%	1595.000000
50%	1595.000000
75%	1599.000000
max	1606.000000

Sorting

Sorting is very simple with DataFrames.

df.sort_values("Name", ascending=True)

	Name	Year	Category
4	A Midsummer Night’s Dream	1595	Comedy
6	Julius Caesar	1599	Tragedy
5	King John	1596	History
8	Macbeth	1606	Tragedy
7	Othello	1604	Tragedy
2	Richard II	1595	History
3	Romeo and Juliet	1595	Tragedy
1	The Comedy of Errors	1594	Comedy
0	Titus Andronicus	1592	Tragedy

Searching for Text Within a DataFrame

df["Name"].str.contains("Julius")

  False
  False
  False
  False
  False
  False
   True
  False
  False
Name: Name, dtype: bool

If you want to find both uppper and lower case examples, you can pass (?i) to the regex parser to tell it to ignore cases.

df["Name"].str.contains("(?i)julius")

  False
  False
  False
  False
  False
  False
   True
  False
  False
Name: Name, dtype: bool

Splitting by Strings

If you want to split on the first element, you have to reference str again.

df["Name"].str.split(" ").str[0]

    Titus
      The
  Richard
    Romeo
        A
     King
   Julius
  Othello
  Macbeth
Name: Name, dtype: object

Grouping and Counting by Label

df["Category"].groupby(df["Category"]).count()

Category
Comedy     2
History    2
Tragedy    5
Name: Category, dtype: int64

Cleaning Data

Bad values can come in many forms, including missing values, NaN, NA, ?, etc. Let’s go over how to find them

df = pd.read_csv(StringIO(csv_data))
df

	A	B	C	D
0	1.0	2.0	3.0	4.0
1	5.0	6.0	NaN	8.0
2	0.0	11.0	12.0	NaN

Now we can clean that dataset. One of the first things to look for is empty values

df.isnull()

	A	B	C	D
0	False	False	False	False
1	False	False	True	False
2	False	False	False	True

This works for tiny datasets, but if it is larger we won’t be able to see them all, so we can just sum up all the True values

df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

Adding

To add a new column, simply declare it and make sure it has the correct number of rows.

df["E"] = df["A"] * df["B"] + 1

df

	A	B	C	D	E
0	1.0	2.0	3.0	4.0	3.0
1	5.0	6.0	NaN	8.0	31.0
2	0.0	11.0	12.0	NaN	1.0

You can also add columns using conditional logic

import numpy as np

df["F"] = np.where(df["B"] > df["E"], 1, 0)

df

	A	B	C	D	E	F
0	1.0	2.0	3.0	4.0	3.0	0
1	5.0	6.0	NaN	8.0	31.0	0
2	0.0	11.0	12.0	NaN	1.0	1

Dropping

There are many ways to drop missing data. The basic dropna drops every row that has a missing value.

df.dropna()

	A	B	C	D	E	F
0	1.0	2.0	3.0	4.0	3.0	0

Note that this doesn’t drop the data from the original DataFrame, it just returns a new one without the dropped values.

Alternatively, you could drop all the columns that have a missing value

df.dropna(axis=1)

	A	B	E	F
0	1.0	2.0	3.0	0
1	5.0	6.0	31.0	0
2	0.0	11.0	1.0	1

But you may want more precision than that. One way to do this is by dropping specific rows or columns by name. To drop columns, use axis=1; to drop rows, use axis=0, or leave it out, because 0 is the default value

df.drop([0,2])

	A	B	C	D	E	F
1	5.0	6.0	NaN	8.0	31.0	0

df.drop(["D", "C"], axis=1)

	A	B	E	F
0	1.0	2.0	3.0	0
1	5.0	6.0	31.0	0
2	0.0	11.0	1.0	1

Selecting specific columns is commonly used when preparing a dataset for machine learning. For example, you might select the features and labels like so

features = df.drop("A", axis=1)
labels = df["A"]

You can also selective decide which rows to drop. You can decide to only drop rows where all columns are NaN.

df.dropna(how="all")

	A	B	C	D	E	F
0	1.0	2.0	3.0	4.0	3.0	0
1	5.0	6.0	NaN	8.0	31.0	0
2	0.0	11.0	12.0	NaN	1.0	1

Or drop all rows that have fewer than 4 non-NaN values

df.dropna(thresh=4)

	A	B	C	D	E	F
0	1.0	2.0	3.0	4.0	3.0	0
1	5.0	6.0	NaN	8.0	31.0	0
2	0.0	11.0	12.0	NaN	1.0	1

Or just drop rows where NaN appears in a specific column

df.dropna(subset=["D"])

	A	B	C	D	E	F
0	1.0	2.0	3.0	4.0	3.0	0
1	5.0	6.0	NaN	8.0	31.0	0

Finding Missing Values

You will need to clean up missing values before doing any machine learning on the data. Scikit-learn will give you an error if you try to run an algorithm on a dataset with missing values.

pandas has two commands to find missing values in a DataFrame, isna and isnull. These are 100% identical. The documentation literally says isnull = isna.

df.isnull()

	A	B	C	D	E	F
0	False	False	False	False	False	False
1	False	False	True	False	False	False
2	False	False	False	True	False	False

Note that pandas will only call a certain type of data as missing. It’s possible that the user used a certain value to denote missing data but pandas doesn’t see it as missing. Below, there are many types of values that could have been intended to denote missing data, but pandas only sees the None value as missing.

a = [False, 0, "na", "NaN"]
b = ["None", None, "nan", "NA"]

df_missing = pd.DataFrame(list(zip(a, b)), columns=(["a", "b"]))

df_missing

	a	b
0	False	None
1	0	None
2	na	nan
3	NaN	NA

df_missing.isnull()

	a	b
0	False	False
1	False	True
2	False	False
3	False	False

But when you import from a CSV, only the missing values will show up as missing.

bad_data = """A,B,C,D
None, none, 3.0, 4.0
Missing, 6.0,, 'None'
0.0, NaN, False,"""

df2 = pd.read_csv(StringIO(bad_data))
df2

	A	B	C	D
0	None	none	3.0	4.0
1	Missing	6.0	NaN	'None'
2	0.0	NaN	False	NaN

df2.isnull()

	A	B	C	D
0	False	False	False	False
1	False	False	True	False
2	False	False	False	True

Interpolating

Sometimes deleting data isn’t the right approach. In these cases, you can interpolate the data. SKLearn has a good library for that.

from sklearn.impute import SimpleImputer

imp = SimpleImputer(strategy="mean")

It uses the same fit and transform approach as other SKLearn tools.

df

	A	B	C	D	E	F
0	1.0	2.0	3.0	4.0	3.0	0
1	5.0	6.0	NaN	8.0	31.0	0
2	0.0	11.0	12.0	NaN	1.0	1

df.isnull()

	A	B	C	D	E	F
0	False	False	False	False	False	False
1	False	False	True	False	False	False
2	False	False	False	True	False	False

imp.fit(df)
imp.transform(df)

array([[ 1. ,  2. ,  3. ,  4. ,  3. ,  0. ],
       [ 5. ,  6. ,  7.5,  8. , 31. ,  0. ],
       [ 0. , 11. , 12. ,  6. ,  1. ,  1. ]])

Data Types

Data are usually split into three types: continuous, ordinal, and categorical. With continuous data the values associated with the data have specific meaning. For example, the price of a good is a continuous value. Common statistical analyses are usually appropriate for continuous values. For example, you can take the average of two continuous values and the result is a meaningful description of the data.

For ordinal data, the values are ordered, but they don’t necessarily have meaning. T-shirt size is a good example of this, where a small is less than a medium, but it’s not less by a specific value. If you converted “small” into a specific measurement (e.g. convert it to inches) it would become a continuous value.

The last type is categorical, and that’s where the differences aren’t associated with any ranking or order. For example, shirtstyle is a categorical variable. We might have “polo” and “T-shirt”, but neither is greater than the other. We could (and will) assign numbers to these categories for machine learning, but we need to remember that those numbers aren’t associated with values as they would be in continuous or ordinal data.

df = pd.DataFrame([["green", "M", 20, "polo"], ["red", "L", 15, "T-shirt"], ["red", "S", 15, "polo"]])
df.columns = ["color", "size", "price", "type"]
df

	color	size	price	type
0	green	M	20	polo
1	red	L	15	T-shirt
2	red	S	15	polo

Because size can be ranked, it is ordinal. We can map it to numbers to make it easier for machine learning.

size_mapping = {"S": 1, "M": 2, "L": 3}
df["size"] = df["size"].map(size_mapping)  # we can do a reverse mapping if we want to undo this in the end
df

	color	size	price	type
0	green	2	20	polo
1	red	3	15	T-shirt
2	red	1	15	polo

Even though the type is categorical, we can still map it to numbers

import numpy as np

class_mapping = {label: idx for idx, label in enumerate(np.unique(df["type"]))}
df["type"] = df["type"].map(class_mapping)
df

	color	size	price	type
0	green	2	20	1
1	red	3	15	0
2	red	1	15	1

After we perform whatever analysis we want to on the data, we could then invert that mapping.

inv_class_mapping = {v: k for k, v in class_mapping.items()}
df["type"] = df["type"].map(inv_class_mapping)
df

	color	size	price	type
0	green	2	20	polo
1	red	3	15	T-shirt
2	red	1	15	polo

Another way to do this is the LabelEncoder class in scikit-learn

from sklearn.preprocessing import LabelEncoder

class_le = LabelEncoder()
y = class_le.fit_transform(df["type"].values)  # fit_transform is a shortcut for calling fit and transform separately
print(y)
class_le.inverse_transform(y)

[1 0 1]

array(['polo', 'T-shirt', 'polo'], dtype=object)

Preparing Data for Machine Learning

It’s a good idea to do one-hot encoding of categorical variables before using them for machine learning. It can also be a good idea for ordinal variables as well, although that’s not always the case. A good rule of thumb is if the mean of two values isn’t a meaningful value, that category should be one-hot encoded.

The downside of treating ordinal data as categorical is that we throw away information about the relative order. The downside of treating it as continuous data is that we introduce a notion of distance. For example, if we set “small” as “1”, “medium” as “2”, and “large” as “3”, we’re telling the model that a large is 3 times a small. This isn’t a meaningful thing to say, so it can reduce model performance.

Shuffling Data

pandas makes it very easy to shuffle data.

df.sample(frac=1).reset_index(drop=True)

	color	size	price	type
0	red	3	15	T-shirt
1	red	1	15	polo
2	green	2	20	polo

Separating Features from Labels

pandas also has a convenient way to extract the labels from a DataFrame, and that’s by using the pop method. It will remove the specified column from the DataFrame and put it into a Series of its own.

labels = df.pop("price")

You can see that it’s no longer in the DataFrame:

df.head()

	color	size	type
0	green	2	polo
1	red	3	T-shirt
2	red	1	polo

labels

  20
  15
  15
Name: price, dtype: int64

One-hot Encoding

pandas can also do one-hot encoding.

pd.get_dummies(df[["color", "size"]])

	size	color_green	color_red
0	2	1	0
1	3	0	1
2	1	0	1

Other Tricks

Don’t Truncate

pandas will by default truncate text that is over a certain limit. When working with text data, sometimes you don’t want this. Here’s how to stop it.

pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", None)

pd.set_option is a nice tool. But if for some reason you don’t want to use that, you can also set it directly.

pd.options.display.max_columns = None
pd.options.display.max_rows = None

Transposing

You can switch the rows and columns of a DataFrame by transposing it by adding .T to the end.

df = pd.DataFrame(list(zip(names, years, categories)), columns=["Name", "Year", "Category"])

df

	Name	Year	Category
0	Titus Andronicus	1592	Tragedy
1	The Comedy of Errors	1594	Comedy
2	Richard II	1595	History
3	Romeo and Juliet	1595	Tragedy
4	A Midsummer Night’s Dream	1595	Comedy
5	King John	1596	History
6	Julius Caesar	1599	Tragedy
7	Othello	1604	Tragedy
8	Macbeth	1606	Tragedy

df = df.T

df

	0	1	2	3	4	5	6	7	8
Name	Titus Andronicus	The Comedy of Errors	Richard II	Romeo and Juliet	A Midsummer Night’s Dream	King John	Julius Caesar	Othello	Macbeth
Year	1592	1594	1595	1595	1595	1596	1599	1604	1606
Category	Tragedy	Comedy	History	Tragedy	Comedy	History	Tragedy	Tragedy	Tragedy