This notebook demonstrates some basic techniques for the Python library pandas. This is part II of the Pandas Cheat Sheet.

Table of Contents

Version

I rerun this code every once in a while to ensure it’s up-to-date. Here’s the latest version it was tested on:

import os
import pandas as pd

print(pd.__version__)
1.4.4

Setup

We’ll use the same data as last time.

shakespeare_path = os.path.join(os.getenv("gdrive"), "JupyterNotebooks/data/shakespeare.csv")
df = pd.read_csv(shakespeare_path)
df
Name Year Category
0 Titus Andronicus 1592 Tragedy
1 The Comedy of Errors 1594 Comedy
2 Richard II 1595 History
3 Romeo and Juliet 1595 Tragedy
4 A Midsummer Night’s Dream 1595 Comedy
5 King John 1596 History
6 Julius Caesar 1599 Tragedy
7 Othello 1604 Tragedy
8 Macbeth 1606 Tragedy

Applying a Function

There are many ways to apply functions to pandas DataFrames. One of the most flexible is to use .apply.

It works well with anonymous lambda functions.

df["Next Year"] = df.apply(lambda x: x["Year"] + 1, axis=1)
df
Name Year Category Next Year
0 Titus Andronicus 1592 Tragedy 1593
1 The Comedy of Errors 1594 Comedy 1595
2 Richard II 1595 History 1596
3 Romeo and Juliet 1595 Tragedy 1596
4 A Midsummer Night’s Dream 1595 Comedy 1596
5 King John 1596 History 1597
6 Julius Caesar 1599 Tragedy 1600
7 Othello 1604 Tragedy 1605
8 Macbeth 1606 Tragedy 1607
def make_year_even(row):
    if row["Year"] % 2 == 0:
        return row["Year"]
    return row["Year"] + 1

Test it by calling it on a single row.

make_year_even(df.iloc[0])
1592
df["Even Year"] = df.apply(make_year_even, axis=1)
df
Name Year Category Next Year Even Year
0 Titus Andronicus 1592 Tragedy 1593 1592
1 The Comedy of Errors 1594 Comedy 1595 1594
2 Richard II 1595 History 1596 1596
3 Romeo and Juliet 1595 Tragedy 1596 1596
4 A Midsummer Night’s Dream 1595 Comedy 1596 1596
5 King John 1596 History 1597 1596
6 Julius Caesar 1599 Tragedy 1600 1600
7 Othello 1604 Tragedy 1605 1604
8 Macbeth 1606 Tragedy 1607 1606

You can also use functions that require arguments.

def add_number(row, number=1):
    return row["Year"] + number
add_number(df.iloc[0], 10)
1602

Then pass the arguments as a list.

df["Next Decade"] = df.apply(add_number, axis=1, args=[10])
df
Name Year Category Next Year Even Year Next Decade
0 Titus Andronicus 1592 Tragedy 1593 1592 1602
1 The Comedy of Errors 1594 Comedy 1595 1594 1604
2 Richard II 1595 History 1596 1596 1605
3 Romeo and Juliet 1595 Tragedy 1596 1596 1605
4 A Midsummer Night’s Dream 1595 Comedy 1596 1596 1605
5 King John 1596 History 1597 1596 1606
6 Julius Caesar 1599 Tragedy 1600 1600 1609
7 Othello 1604 Tragedy 1605 1604 1614
8 Macbeth 1606 Tragedy 1607 1606 1616

Note that if you try to pass the arguments in a tuple you’ll get an error.

try:
    df["Next Decade"] = df.apply(add_number, axis=1, args=(10))
except TypeError as err:
    print(err)
add_number() argument after * must be an iterable, not int

Selecting Data

.loc vs .iloc

.loc searches for labels by name (aka label-based indexing), .iloc gets rows by index number (aka positional indexing)

For a DataFrame where the index matches the row number (i.e. it starts at 0 and doesn’t skip any values), there is no difference between .loc and .iloc

df.loc[1]
Name           The Comedy of Errors
Year                           1594
Category                     Comedy
Next Year                      1595
Even Year                      1594
Next Decade                    1604
Name: 1, dtype: object
df.iloc[1]
Name           The Comedy of Errors
Year                           1594
Category                     Comedy
Next Year                      1595
Even Year                      1594
Next Decade                    1604
Name: 1, dtype: object

But once the index and row numbers become different, you have to be careful about which one to use.

new_df = df.drop(1)
new_df
Name Year Category Next Year Even Year Next Decade
0 Titus Andronicus 1592 Tragedy 1593 1592 1602
2 Richard II 1595 History 1596 1596 1605
3 Romeo and Juliet 1595 Tragedy 1596 1596 1605
4 A Midsummer Night’s Dream 1595 Comedy 1596 1596 1605
5 King John 1596 History 1597 1596 1606
6 Julius Caesar 1599 Tragedy 1600 1600 1609
7 Othello 1604 Tragedy 1605 1604 1614
8 Macbeth 1606 Tragedy 1607 1606 1616
new_df.iloc[1]
Name           Richard II
Year                 1595
Category          History
Next Year            1596
Even Year            1596
Next Decade          1605
Name: 2, dtype: object

There is still an index of value 1, but there is no longer a label. So if we try .loc[1], we’ll get an error.

try:
    new_df.loc[1]
except KeyError:
    print("There is nothing with index 1")
There is nothing with index 1

If you want to update your indices you can do so with reset_index.

new_df.reset_index(drop=True)
Name Year Category Next Year Even Year Next Decade
0 Titus Andronicus 1592 Tragedy 1593 1592 1602
1 Richard II 1595 History 1596 1596 1605
2 Romeo and Juliet 1595 Tragedy 1596 1596 1605
3 A Midsummer Night’s Dream 1595 Comedy 1596 1596 1605
4 King John 1596 History 1597 1596 1606
5 Julius Caesar 1599 Tragedy 1600 1600 1609
6 Othello 1604 Tragedy 1605 1604 1614
7 Macbeth 1606 Tragedy 1607 1606 1616

When you use .iloc, you can treat the results just like a numpy matrix. This means that trailing colons are always optional (just like numpy).

Using the Index Name

In certain cases, you might want to use a specific column’s values as the DataFrame index for easier data retrieval or analysis. For example, if you have a DataFrame where each row represents a distinct entity (like a play by Shakespeare), setting the name of the play as the index could be beneficial. Here’s how you can do that:

new_df.set_index('Name', inplace=True)
new_df
Year Category Next Year Even Year Next Decade
Name
Titus Andronicus 1592 Tragedy 1593 1592 1602
Richard II 1595 History 1596 1596 1605
Romeo and Juliet 1595 Tragedy 1596 1596 1605
A Midsummer Night’s Dream 1595 Comedy 1596 1596 1605
King John 1596 History 1597 1596 1606
Julius Caesar 1599 Tragedy 1600 1600 1609
Othello 1604 Tragedy 1605 1604 1614
Macbeth 1606 Tragedy 1607 1606 1616

Now, the index is the name. You can look at it like so:

new_df.index
Index(['Titus Andronicus', 'Richard II', 'Romeo and Juliet',
       'A Midsummer Night’s Dream', 'King John', 'Julius Caesar', 'Othello',
       'Macbeth'],
      dtype='object', name='Name')

It’s no longer a normal column, so you can’t call it as if it was.

try:
    new_df['Name']
except KeyError as err:
    print("You can't treat the index like another column.")
You can't treat the index like another column.

If you want to get all the values in the index, you have to do this:

new_df.index.tolist()
['Titus Andronicus',
 'Richard II',
 'Romeo and Juliet',
 'A Midsummer Night’s Dream',
 'King John',
 'Julius Caesar',
 'Othello',
 'Macbeth']

Editing DataFrames

using .at and .iat

.at provides label-based scalar lookups, allowing you to access a single value for a row/column label pair. It’s useful for quickly updating a specific value.

.iat provides integer-based lookups by position, allowing you to access a single value by row/column position. It’s handy for editing values when you know the position in the DataFrame.

.at and .iat aren’t as commonly used, but they are the recommended way to edit pandas DataFrames.

df
Name Year Category Next Year Even Year Next Decade
0 Titus Andronicus 1592 Tragedy 1593 1592 1602
1 The Comedy of Errors 1594 Comedy 1595 1594 1604
2 Richard II 1595 History 1596 1596 1605
3 Romeo and Juliet 1595 Tragedy 1596 1596 1605
4 A Midsummer Night’s Dream 1595 Comedy 1596 1596 1605
5 King John 1596 History 1597 1596 1606
6 Julius Caesar 1599 Tragedy 1600 1600 1609
7 Othello 1604 Tragedy 1605 1604 1614
8 Macbeth 1606 Tragedy 1607 1606 1616
df.at[2, "Name"]
'Richard II'
df.at[2, "Name"] = "Richard the Second"
df
Name Year Category Next Year Even Year Next Decade
0 Titus Andronicus 1592 Tragedy 1593 1592 1602
1 The Comedy of Errors 1594 Comedy 1595 1594 1604
2 Richard the Second 1595 History 1596 1596 1605
3 Romeo and Juliet 1595 Tragedy 1596 1596 1605
4 A Midsummer Night’s Dream 1595 Comedy 1596 1596 1605
5 King John 1596 History 1597 1596 1606
6 Julius Caesar 1599 Tragedy 1600 1600 1609
7 Othello 1604 Tragedy 1605 1604 1614
8 Macbeth 1606 Tragedy 1607 1606 1616

You can also query by row and column number. Note the index doesn’t count, so column 2 is “Category”.

df.iat[1, 2]
'Comedy'
df.iat[1, 2] = "Tragicomedy"
df
Name Year Category Next Year Even Year Next Decade
0 Titus Andronicus 1592 Tragedy 1593 1592 1602
1 The Comedy of Errors 1594 Tragicomedy 1595 1594 1604
2 Richard the Second 1595 History 1596 1596 1605
3 Romeo and Juliet 1595 Tragedy 1596 1596 1605
4 A Midsummer Night’s Dream 1595 Comedy 1596 1596 1605
5 King John 1596 History 1597 1596 1606
6 Julius Caesar 1599 Tragedy 1600 1600 1609
7 Othello 1604 Tragedy 1605 1604 1614
8 Macbeth 1606 Tragedy 1607 1606 1616

Changing Specific Rows

Note that there are a lot of wrong ways to do this in pandas.

df
Name Year Category Next Year Even Year Next Decade
0 Titus Andronicus 1592 Tragedy 1593 1592 1602
1 The Comedy of Errors 1594 Tragicomedy 1595 1594 1604
2 Richard the Second 1595 History 1596 1596 1605
3 Romeo and Juliet 1595 Tragedy 1596 1596 1605
4 A Midsummer Night’s Dream 1595 Comedy 1596 1596 1605
5 King John 1596 History 1597 1596 1606
6 Julius Caesar 1599 Tragedy 1600 1600 1609
7 Othello 1604 Tragedy 1605 1604 1614
8 Macbeth 1606 Tragedy 1607 1606 1616

Here you might think you’re setting the value, but you’re not.

df.loc[df.index[-4:]]["Category"] = "Drama"
df
Name Year Category Next Year Even Year Next Decade
0 Titus Andronicus 1592 Tragedy 1593 1592 1602
1 The Comedy of Errors 1594 Tragicomedy 1595 1594 1604
2 Richard the Second 1595 History 1596 1596 1605
3 Romeo and Juliet 1595 Tragedy 1596 1596 1605
4 A Midsummer Night’s Dream 1595 Comedy 1596 1596 1605
5 King John 1596 History 1597 1596 1606
6 Julius Caesar 1599 Tragedy 1600 1600 1609
7 Othello 1604 Tragedy 1605 1604 1614
8 Macbeth 1606 Tragedy 1607 1606 1616

This is how you have to do it.

df.loc[df.index[-4:], "Category"] = "Drama"
df
Name Year Category Next Year Even Year Next Decade
0 Titus Andronicus 1592 Tragedy 1593 1592 1602
1 The Comedy of Errors 1594 Tragicomedy 1595 1594 1604
2 Richard the Second 1595 History 1596 1596 1605
3 Romeo and Juliet 1595 Tragedy 1596 1596 1605
4 A Midsummer Night’s Dream 1595 Comedy 1596 1596 1605
5 King John 1596 Drama 1597 1596 1606
6 Julius Caesar 1599 Drama 1600 1600 1609
7 Othello 1604 Drama 1605 1604 1614
8 Macbeth 1606 Drama 1607 1606 1616

Checking if Value Is Present Inside Pandas Series

ds = pd.Series([{1,2,3}, {2,3,4}, {5,6,7}])
ds
0    {1, 2, 3}
1    {2, 3, 4}
2    {5, 6, 7}
dtype: object

We can’t do it this way:

3 in ds
False

Instead we have to apply that to every row.

ds.apply(lambda row: 3 in row)
0     True
1     True
2    False
dtype: bool