This notebook demonstrates some basic techniques for the Python library pandas. This is part II of the Pandas Cheat Sheet.
Table of Contents
- Version
- Setup
- Selecting Data
- Editing DataFrames
- Changing Specific Rows
- Checking if Value Is Present Inside Pandas Series
Version
I rerun this code every once in a while to ensure it’s up-to-date. Here’s the latest version it was tested on:
import os
import pandas as pd
print(pd.__version__)
1.4.4
Setup
We’ll use the same data as last time.
shakespeare_path = os.path.join(os.getenv("gdrive"), "JupyterNotebooks/data/shakespeare.csv")
df = pd.read_csv(shakespeare_path)
df
Name | Year | Category | |
---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy |
1 | The Comedy of Errors | 1594 | Comedy |
2 | Richard II | 1595 | History |
3 | Romeo and Juliet | 1595 | Tragedy |
4 | A Midsummer Night’s Dream | 1595 | Comedy |
5 | King John | 1596 | History |
6 | Julius Caesar | 1599 | Tragedy |
7 | Othello | 1604 | Tragedy |
8 | Macbeth | 1606 | Tragedy |
Applying a Function
There are many ways to apply functions to pandas DataFrames. One of the most flexible is to use .apply
.
It works well with anonymous lambda
functions.
df["Next Year"] = df.apply(lambda x: x["Year"] + 1, axis=1)
df
Name | Year | Category | Next Year | |
---|---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy | 1593 |
1 | The Comedy of Errors | 1594 | Comedy | 1595 |
2 | Richard II | 1595 | History | 1596 |
3 | Romeo and Juliet | 1595 | Tragedy | 1596 |
4 | A Midsummer Night’s Dream | 1595 | Comedy | 1596 |
5 | King John | 1596 | History | 1597 |
6 | Julius Caesar | 1599 | Tragedy | 1600 |
7 | Othello | 1604 | Tragedy | 1605 |
8 | Macbeth | 1606 | Tragedy | 1607 |
def make_year_even(row):
if row["Year"] % 2 == 0:
return row["Year"]
return row["Year"] + 1
Test it by calling it on a single row.
make_year_even(df.iloc[0])
1592
df["Even Year"] = df.apply(make_year_even, axis=1)
df
Name | Year | Category | Next Year | Even Year | |
---|---|---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy | 1593 | 1592 |
1 | The Comedy of Errors | 1594 | Comedy | 1595 | 1594 |
2 | Richard II | 1595 | History | 1596 | 1596 |
3 | Romeo and Juliet | 1595 | Tragedy | 1596 | 1596 |
4 | A Midsummer Night’s Dream | 1595 | Comedy | 1596 | 1596 |
5 | King John | 1596 | History | 1597 | 1596 |
6 | Julius Caesar | 1599 | Tragedy | 1600 | 1600 |
7 | Othello | 1604 | Tragedy | 1605 | 1604 |
8 | Macbeth | 1606 | Tragedy | 1607 | 1606 |
You can also use functions that require arguments.
def add_number(row, number=1):
return row["Year"] + number
add_number(df.iloc[0], 10)
1602
Then pass the arguments as a list.
df["Next Decade"] = df.apply(add_number, axis=1, args=[10])
df
Name | Year | Category | Next Year | Even Year | Next Decade | |
---|---|---|---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy | 1593 | 1592 | 1602 |
1 | The Comedy of Errors | 1594 | Comedy | 1595 | 1594 | 1604 |
2 | Richard II | 1595 | History | 1596 | 1596 | 1605 |
3 | Romeo and Juliet | 1595 | Tragedy | 1596 | 1596 | 1605 |
4 | A Midsummer Night’s Dream | 1595 | Comedy | 1596 | 1596 | 1605 |
5 | King John | 1596 | History | 1597 | 1596 | 1606 |
6 | Julius Caesar | 1599 | Tragedy | 1600 | 1600 | 1609 |
7 | Othello | 1604 | Tragedy | 1605 | 1604 | 1614 |
8 | Macbeth | 1606 | Tragedy | 1607 | 1606 | 1616 |
Note that if you try to pass the arguments in a tuple you’ll get an error.
try:
df["Next Decade"] = df.apply(add_number, axis=1, args=(10))
except TypeError as err:
print(err)
add_number() argument after * must be an iterable, not int
Selecting Data
.loc vs .iloc
.loc searches for labels by name (aka label-based indexing), .iloc gets rows by index number (aka positional indexing)
For a DataFrame where the index matches the row number (i.e. it starts at 0 and doesn’t skip any values), there is no difference between .loc and .iloc
df.loc[1]
Name The Comedy of Errors
Year 1594
Category Comedy
Next Year 1595
Even Year 1594
Next Decade 1604
Name: 1, dtype: object
df.iloc[1]
Name The Comedy of Errors
Year 1594
Category Comedy
Next Year 1595
Even Year 1594
Next Decade 1604
Name: 1, dtype: object
But once the index and row numbers become different, you have to be careful about which one to use.
new_df = df.drop(1)
new_df
Name | Year | Category | Next Year | Even Year | Next Decade | |
---|---|---|---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy | 1593 | 1592 | 1602 |
2 | Richard II | 1595 | History | 1596 | 1596 | 1605 |
3 | Romeo and Juliet | 1595 | Tragedy | 1596 | 1596 | 1605 |
4 | A Midsummer Night’s Dream | 1595 | Comedy | 1596 | 1596 | 1605 |
5 | King John | 1596 | History | 1597 | 1596 | 1606 |
6 | Julius Caesar | 1599 | Tragedy | 1600 | 1600 | 1609 |
7 | Othello | 1604 | Tragedy | 1605 | 1604 | 1614 |
8 | Macbeth | 1606 | Tragedy | 1607 | 1606 | 1616 |
new_df.iloc[1]
Name Richard II
Year 1595
Category History
Next Year 1596
Even Year 1596
Next Decade 1605
Name: 2, dtype: object
There is still an index of value 1, but there is no longer a label. So if we try .loc[1]
, we’ll get an error.
try:
new_df.loc[1]
except KeyError:
print("There is nothing with index 1")
There is nothing with index 1
If you want to update your indices you can do so with reset_index
.
new_df.reset_index(drop=True)
Name | Year | Category | Next Year | Even Year | Next Decade | |
---|---|---|---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy | 1593 | 1592 | 1602 |
1 | Richard II | 1595 | History | 1596 | 1596 | 1605 |
2 | Romeo and Juliet | 1595 | Tragedy | 1596 | 1596 | 1605 |
3 | A Midsummer Night’s Dream | 1595 | Comedy | 1596 | 1596 | 1605 |
4 | King John | 1596 | History | 1597 | 1596 | 1606 |
5 | Julius Caesar | 1599 | Tragedy | 1600 | 1600 | 1609 |
6 | Othello | 1604 | Tragedy | 1605 | 1604 | 1614 |
7 | Macbeth | 1606 | Tragedy | 1607 | 1606 | 1616 |
When you use .iloc
, you can treat the results just like a numpy
matrix. This means that trailing colons are always optional (just like numpy
).
Using the Index Name
In certain cases, you might want to use a specific column’s values as the DataFrame index for easier data retrieval or analysis. For example, if you have a DataFrame where each row represents a distinct entity (like a play by Shakespeare), setting the name of the play as the index could be beneficial. Here’s how you can do that:
new_df.set_index('Name', inplace=True)
new_df
Year | Category | Next Year | Even Year | Next Decade | |
---|---|---|---|---|---|
Name | |||||
Titus Andronicus | 1592 | Tragedy | 1593 | 1592 | 1602 |
Richard II | 1595 | History | 1596 | 1596 | 1605 |
Romeo and Juliet | 1595 | Tragedy | 1596 | 1596 | 1605 |
A Midsummer Night’s Dream | 1595 | Comedy | 1596 | 1596 | 1605 |
King John | 1596 | History | 1597 | 1596 | 1606 |
Julius Caesar | 1599 | Tragedy | 1600 | 1600 | 1609 |
Othello | 1604 | Tragedy | 1605 | 1604 | 1614 |
Macbeth | 1606 | Tragedy | 1607 | 1606 | 1616 |
Now, the index is the name. You can look at it like so:
new_df.index
Index(['Titus Andronicus', 'Richard II', 'Romeo and Juliet',
'A Midsummer Night’s Dream', 'King John', 'Julius Caesar', 'Othello',
'Macbeth'],
dtype='object', name='Name')
It’s no longer a normal column, so you can’t call it as if it was.
try:
new_df['Name']
except KeyError as err:
print("You can't treat the index like another column.")
You can't treat the index like another column.
If you want to get all the values in the index, you have to do this:
new_df.index.tolist()
['Titus Andronicus',
'Richard II',
'Romeo and Juliet',
'A Midsummer Night’s Dream',
'King John',
'Julius Caesar',
'Othello',
'Macbeth']
Editing DataFrames
using .at and .iat
.at
provides label-based scalar lookups, allowing you to access a single value for a row/column label pair. It’s useful for quickly updating a specific value.
.iat
provides integer-based lookups by position, allowing you to access a single value by row/column position. It’s handy for editing values when you know the position in the DataFrame.
.at
and .iat
aren’t as commonly used, but they are the recommended way to edit pandas DataFrames.
df
Name | Year | Category | Next Year | Even Year | Next Decade | |
---|---|---|---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy | 1593 | 1592 | 1602 |
1 | The Comedy of Errors | 1594 | Comedy | 1595 | 1594 | 1604 |
2 | Richard II | 1595 | History | 1596 | 1596 | 1605 |
3 | Romeo and Juliet | 1595 | Tragedy | 1596 | 1596 | 1605 |
4 | A Midsummer Night’s Dream | 1595 | Comedy | 1596 | 1596 | 1605 |
5 | King John | 1596 | History | 1597 | 1596 | 1606 |
6 | Julius Caesar | 1599 | Tragedy | 1600 | 1600 | 1609 |
7 | Othello | 1604 | Tragedy | 1605 | 1604 | 1614 |
8 | Macbeth | 1606 | Tragedy | 1607 | 1606 | 1616 |
df.at[2, "Name"]
'Richard II'
df.at[2, "Name"] = "Richard the Second"
df
Name | Year | Category | Next Year | Even Year | Next Decade | |
---|---|---|---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy | 1593 | 1592 | 1602 |
1 | The Comedy of Errors | 1594 | Comedy | 1595 | 1594 | 1604 |
2 | Richard the Second | 1595 | History | 1596 | 1596 | 1605 |
3 | Romeo and Juliet | 1595 | Tragedy | 1596 | 1596 | 1605 |
4 | A Midsummer Night’s Dream | 1595 | Comedy | 1596 | 1596 | 1605 |
5 | King John | 1596 | History | 1597 | 1596 | 1606 |
6 | Julius Caesar | 1599 | Tragedy | 1600 | 1600 | 1609 |
7 | Othello | 1604 | Tragedy | 1605 | 1604 | 1614 |
8 | Macbeth | 1606 | Tragedy | 1607 | 1606 | 1616 |
You can also query by row and column number. Note the index doesn’t count, so column 2 is “Category”.
df.iat[1, 2]
'Comedy'
df.iat[1, 2] = "Tragicomedy"
df
Name | Year | Category | Next Year | Even Year | Next Decade | |
---|---|---|---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy | 1593 | 1592 | 1602 |
1 | The Comedy of Errors | 1594 | Tragicomedy | 1595 | 1594 | 1604 |
2 | Richard the Second | 1595 | History | 1596 | 1596 | 1605 |
3 | Romeo and Juliet | 1595 | Tragedy | 1596 | 1596 | 1605 |
4 | A Midsummer Night’s Dream | 1595 | Comedy | 1596 | 1596 | 1605 |
5 | King John | 1596 | History | 1597 | 1596 | 1606 |
6 | Julius Caesar | 1599 | Tragedy | 1600 | 1600 | 1609 |
7 | Othello | 1604 | Tragedy | 1605 | 1604 | 1614 |
8 | Macbeth | 1606 | Tragedy | 1607 | 1606 | 1616 |
Changing Specific Rows
Note that there are a lot of wrong ways to do this in pandas.
df
Name | Year | Category | Next Year | Even Year | Next Decade | |
---|---|---|---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy | 1593 | 1592 | 1602 |
1 | The Comedy of Errors | 1594 | Tragicomedy | 1595 | 1594 | 1604 |
2 | Richard the Second | 1595 | History | 1596 | 1596 | 1605 |
3 | Romeo and Juliet | 1595 | Tragedy | 1596 | 1596 | 1605 |
4 | A Midsummer Night’s Dream | 1595 | Comedy | 1596 | 1596 | 1605 |
5 | King John | 1596 | History | 1597 | 1596 | 1606 |
6 | Julius Caesar | 1599 | Tragedy | 1600 | 1600 | 1609 |
7 | Othello | 1604 | Tragedy | 1605 | 1604 | 1614 |
8 | Macbeth | 1606 | Tragedy | 1607 | 1606 | 1616 |
Here you might think you’re setting the value, but you’re not.
df.loc[df.index[-4:]]["Category"] = "Drama"
df
Name | Year | Category | Next Year | Even Year | Next Decade | |
---|---|---|---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy | 1593 | 1592 | 1602 |
1 | The Comedy of Errors | 1594 | Tragicomedy | 1595 | 1594 | 1604 |
2 | Richard the Second | 1595 | History | 1596 | 1596 | 1605 |
3 | Romeo and Juliet | 1595 | Tragedy | 1596 | 1596 | 1605 |
4 | A Midsummer Night’s Dream | 1595 | Comedy | 1596 | 1596 | 1605 |
5 | King John | 1596 | History | 1597 | 1596 | 1606 |
6 | Julius Caesar | 1599 | Tragedy | 1600 | 1600 | 1609 |
7 | Othello | 1604 | Tragedy | 1605 | 1604 | 1614 |
8 | Macbeth | 1606 | Tragedy | 1607 | 1606 | 1616 |
This is how you have to do it.
df.loc[df.index[-4:], "Category"] = "Drama"
df
Name | Year | Category | Next Year | Even Year | Next Decade | |
---|---|---|---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy | 1593 | 1592 | 1602 |
1 | The Comedy of Errors | 1594 | Tragicomedy | 1595 | 1594 | 1604 |
2 | Richard the Second | 1595 | History | 1596 | 1596 | 1605 |
3 | Romeo and Juliet | 1595 | Tragedy | 1596 | 1596 | 1605 |
4 | A Midsummer Night’s Dream | 1595 | Comedy | 1596 | 1596 | 1605 |
5 | King John | 1596 | Drama | 1597 | 1596 | 1606 |
6 | Julius Caesar | 1599 | Drama | 1600 | 1600 | 1609 |
7 | Othello | 1604 | Drama | 1605 | 1604 | 1614 |
8 | Macbeth | 1606 | Drama | 1607 | 1606 | 1616 |
Checking if Value Is Present Inside Pandas Series
ds = pd.Series([{1,2,3}, {2,3,4}, {5,6,7}])
ds
0 {1, 2, 3}
1 {2, 3, 4}
2 {5, 6, 7}
dtype: object
We can’t do it this way:
3 in ds
False
Instead we have to apply that to every row.
ds.apply(lambda row: 3 in row)
0 True
1 True
2 False
dtype: bool