This post demonstrates some basic techniques for the Python library pandas.
Table of Contents
- Version
- Importing Data
- Exploring Data
- Cleaning Data
- Data Types
- Preparing Data for Machine Learning
- Other Tricks
Version
I rerun this code every once in a while to ensure it’s up-to-date. Here’s the latest version it was tested on:
import pandas as pd
print(pd.__version__)
1.4.4
Importing Data
DataFrames from CSV Files
In pandas, DataFrames are the primary structure for dealing with data. They provide indexed rows and columns of a dataset, much like a spreadsheet. There are many ways to get data into DataFrames. Perhaps the most common way of getting data into DataFrames is by importing CSV files.
shakespeare_path = "C:/Users/Julius/Google Drive/JupyterNotebooks/data/shakespeare.csv"
df = pd.read_csv(shakespeare_path)
df
Name | Year | Category | |
---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy |
1 | The Comedy of Errors | 1594 | Comedy |
2 | Richard II | 1595 | History |
3 | Romeo and Juliet | 1595 | Tragedy |
4 | A Midsummer Night’s Dream | 1595 | Comedy |
5 | King John | 1596 | History |
6 | Julius Caesar | 1599 | Tragedy |
7 | Othello | 1604 | Tragedy |
8 | Macbeth | 1606 | Tragedy |
Although importing from a CSV file is perhaps the most common way of getting data into a DataFrame, there are many alternatives.
There are a lot of options that you can do with read_csv
. One in particular that I like is the ability to limit the number of rows you read, which allows for you to experiment with smaller DataFrames to make debugging easier.
df = pd.read_csv(shakespeare_path, nrows=2)
df
Name | Year | Category | |
---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy |
1 | The Comedy of Errors | 1594 | Comedy |
DataFrames from Strings
Another way you can create a DataFrame is by using StringIO, which allows you to read a string as if it were a CSV file. It even allows for missing data.
from io import StringIO
csv_data = """A,B,C,D
1.0, 2.0, 3.0, 4.0
5.0, 6.0,, 8.0
0.0, 11.0, 12.0,"""
df = pd.read_csv(StringIO(csv_data))
df
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | 2.0 | 3.0 | 4.0 |
1 | 5.0 | 6.0 | NaN | 8.0 |
2 | 0.0 | 11.0 | 12.0 | NaN |
DataFrames from .data Files
Machine learning datasets from the UCI Machine Learning Repository often have datasets with the .data extension. These can also be read using pd.read_csv
.
splice_path = "C:/Users/Julius/Google Drive/JupyterNotebooks/data/splice.data"
df = pd.read_csv(splice_path, names=["Class", "Instance", "Sequence"])
df.head()
Class | Instance | Sequence | |
---|---|---|---|
0 | EI | ATRINS-DONOR-521 | CCAGCTGCATCACAGGAGGCCAGCGAGCAGG... |
1 | EI | ATRINS-DONOR-905 | AGACCCGCCGGGAGGCGGAGGACCTGCAGGG... |
2 | EI | BABAPOE-DONOR-30 | GAGGTGAAGGACGTCCTTCCCCAGGAGCCGG... |
3 | EI | BABAPOE-DONOR-867 | GGGCTGCGTTGCTGGTCACATTCCTGGCAGGT... |
4 | EI | BABAPOE-DONOR-2817 | GCTCAGCCCCCAGGTCACCCAGGAACTGACGTG... |
Note that you can also read files directly from the Internet without downloading them first.
df = pd.read_csv("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/SAheart.data")
df.head()
row.names | sbp | tobacco | ldl | adiposity | famhist | typea | obesity | alcohol | age | chd | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 160 | 12.00 | 5.73 | 23.11 | Present | 49 | 25.30 | 97.20 | 52 | 1 |
1 | 2 | 144 | 0.01 | 4.41 | 28.61 | Absent | 55 | 28.87 | 2.06 | 63 | 1 |
2 | 3 | 118 | 0.08 | 3.48 | 32.28 | Present | 52 | 29.14 | 3.81 | 46 | 0 |
3 | 4 | 170 | 7.50 | 6.41 | 38.03 | Present | 51 | 31.99 | 24.26 | 58 | 1 |
4 | 5 | 134 | 13.60 | 3.50 | 27.78 | Present | 60 | 25.99 | 57.34 | 49 | 1 |
DataFrames from Lists
You can also create a DataFrame from a list. For a single list, you can put it directly in a DataFrame.
names = [
"Titus Andronicus",
"The Comedy of Errors",
"Richard II",
"Romeo and Juliet",
"A Midsummer Night’s Dream",
"King John",
"Julius Caesar",
"Othello",
"Macbeth",
]
pd.DataFrame(names, columns=["Name"])
Name | |
---|---|
0 | Titus Andronicus |
1 | The Comedy of Errors |
2 | Richard II |
3 | Romeo and Juliet |
4 | A Midsummer Night’s Dream |
5 | King John |
6 | Julius Caesar |
7 | Othello |
8 | Macbeth |
But if you’re combining multiple lists, you need to zip them first.
years = [1592, 1594, 1595, 1595, 1595, 1596, 1599, 1604, 1606]
categories = ["Tragedy", "Comedy", "History", "Tragedy", "Comedy", "History", "Tragedy", "Tragedy", "Tragedy"]
df = pd.DataFrame(list(zip(names, years, categories)), columns=["Name", "Year", "Category"])
df
Name | Year | Category | |
---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy |
1 | The Comedy of Errors | 1594 | Comedy |
2 | Richard II | 1595 | History |
3 | Romeo and Juliet | 1595 | Tragedy |
4 | A Midsummer Night’s Dream | 1595 | Comedy |
5 | King John | 1596 | History |
6 | Julius Caesar | 1599 | Tragedy |
7 | Othello | 1604 | Tragedy |
8 | Macbeth | 1606 | Tragedy |
DataFrames from Dicts
A more common way to do it is through dictionaries
my_dict = {"Name": names, "Year": years, "Category": categories}
my_dict
{'Name': ['Titus Andronicus',
'The Comedy of Errors',
'Richard II',
'Romeo and Juliet',
'A Midsummer Night’s Dream',
'King John',
'Julius Caesar',
'Othello',
'Macbeth'],
'Year': [1592, 1594, 1595, 1595, 1595, 1596, 1599, 1604, 1606],
'Category': ['Tragedy',
'Comedy',
'History',
'Tragedy',
'Comedy',
'History',
'Tragedy',
'Tragedy',
'Tragedy']}
df = pd.DataFrame(my_dict)
df
Name | Year | Category | |
---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy |
1 | The Comedy of Errors | 1594 | Comedy |
2 | Richard II | 1595 | History |
3 | Romeo and Juliet | 1595 | Tragedy |
4 | A Midsummer Night’s Dream | 1595 | Comedy |
5 | King John | 1596 | History |
6 | Julius Caesar | 1599 | Tragedy |
7 | Othello | 1604 | Tragedy |
8 | Macbeth | 1606 | Tragedy |
If you have a simple dictionary like this: d = {'a': 1, 'b': 2}
, you can’t just put it in a pd.DataFrame
, because it’ll give you a ValueError
for not passing an index. Instead, you can do this: pd.DataFrame(list(d.items()))
. You could also pass column names so it looks more like this: pd.DataFrame(list(d.items()), columns=['Key', 'Value'])
.
You might not want the dictionary keys, in which can you can do this: pd.DataFrame(list(building_dict.values()))
Saving a DataFrame
There are many ways to save a DataFrame, including in a pickle, msgpack, CSV, and HDF5Store. They all follow similar syntax.
df.to_csv("shakespeare.csv")
There’s something I find a little weird about saving a loading DataFrames. It’s that if you save it as above and then load it, you’ll get an extra column.
df = pd.read_csv("shakespeare.csv")
df.head()
Unnamed: 0 | Name | Year | Category | |
---|---|---|---|---|
0 | 0 | Titus Andronicus | 1592 | Tragedy |
1 | 1 | The Comedy of Errors | 1594 | Comedy |
2 | 2 | Richard II | 1595 | History |
3 | 3 | Romeo and Juliet | 1595 | Tragedy |
4 | 4 | A Midsummer Night’s Dream | 1595 | Comedy |
There are two ways to avoid this. The first I recommend if your row names are just 0-4 (or any number) as they are above. In that case, when you save it, you want to save it like so:
df.to_csv("shakespeare.csv", index=False)
df = pd.read_csv("shakespeare.csv")
df.head()
Unnamed: 0 | Name | Year | Category | |
---|---|---|---|---|
0 | 0 | Titus Andronicus | 1592 | Tragedy |
1 | 1 | The Comedy of Errors | 1594 | Comedy |
2 | 2 | Richard II | 1595 | History |
3 | 3 | Romeo and Juliet | 1595 | Tragedy |
4 | 4 | A Midsummer Night’s Dream | 1595 | Comedy |
The second way is better in the case when you do have index names that you want to save, but it will still work in this case. What you want to do is save your DataFrame like you did at first:
df.to_csv("shakespeare.csv")
Then when you open it, pass index_col=0
like so:
df = pd.read_csv("shakespeare.csv", index_col=0)
df.head()
Unnamed: 0 | Name | Year | Category | |
---|---|---|---|---|
0 | 0 | Titus Andronicus | 1592 | Tragedy |
1 | 1 | The Comedy of Errors | 1594 | Comedy |
2 | 2 | Richard II | 1595 | History |
3 | 3 | Romeo and Juliet | 1595 | Tragedy |
4 | 4 | A Midsummer Night’s Dream | 1595 | Comedy |
Exploring Data
Displaying Parts of the DataFrame
OK, now that we have the data into a DataFrame, let’s explore it. To get a quick preview of the data, you can use head()
.
df = pd.read_csv(shakespeare_path)
df.head(3)
Name | Year | Category | |
---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy |
1 | The Comedy of Errors | 1594 | Comedy |
2 | Richard II | 1595 | History |
To see a preview of the end, use tail()
. If you want to see some columns at random, use sample()
df.sample(5)
Name | Year | Category | |
---|---|---|---|
1 | The Comedy of Errors | 1594 | Comedy |
3 | Romeo and Juliet | 1595 | Tragedy |
4 | A Midsummer Night’s Dream | 1595 | Comedy |
2 | Richard II | 1595 | History |
7 | Othello | 1604 | Tragedy |
Or you can slice the DataFrame
df[5:]
Name | Year | Category | |
---|---|---|---|
5 | King John | 1596 | History |
6 | Julius Caesar | 1599 | Tragedy |
7 | Othello | 1604 | Tragedy |
8 | Macbeth | 1606 | Tragedy |
df[0:5:2]
Name | Year | Category | |
---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy |
2 | Richard II | 1595 | History |
4 | A Midsummer Night’s Dream | 1595 | Comedy |
You can also select by column. A single column of a Pandas DataFrame is known as a Pandas Series.
type(df["Name"])
pandas.core.series.Series
You can also make selections by inequalities, such as finding all the points where a column has a value greater than a specific amount.
df[df["Year"] > 1600]
Name | Year | Category | |
---|---|---|---|
7 | Othello | 1604 | Tragedy |
8 | Macbeth | 1606 | Tragedy |
You can also select a specific cell based on its label.
df.loc[1, "Name"]
'The Comedy of Errors'
Built-in Descriptors
There are a few built-in descriptors that are good to know.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 9 non-null object
1 Year 9 non-null int64
2 Category 9 non-null object
dtypes: int64(1), object(2)
memory usage: 344.0+ bytes
df.shape
(9, 3)
For quantitative data, there are a few more worth knowing.
Quantitative Data
You can also find the data type of each column
df.dtypes
Name object
Year int64
Category object
dtype: object
You can do even more with quantitative data. You can find the mean of each quantitative column in the dataset (e.g. ints and floats)
df["Year"].mean()
1597.3333333333333
Or you can use describe to find even more.
df.describe()
Year | |
---|---|
count | 9.000000 |
mean | 1597.333333 |
std | 4.743416 |
min | 1592.000000 |
25% | 1595.000000 |
50% | 1595.000000 |
75% | 1599.000000 |
max | 1606.000000 |
Sorting
Sorting is very simple with DataFrames.
df.sort_values("Name", ascending=True)
Name | Year | Category | |
---|---|---|---|
4 | A Midsummer Night’s Dream | 1595 | Comedy |
6 | Julius Caesar | 1599 | Tragedy |
5 | King John | 1596 | History |
8 | Macbeth | 1606 | Tragedy |
7 | Othello | 1604 | Tragedy |
2 | Richard II | 1595 | History |
3 | Romeo and Juliet | 1595 | Tragedy |
1 | The Comedy of Errors | 1594 | Comedy |
0 | Titus Andronicus | 1592 | Tragedy |
Searching for Text Within a DataFrame
df["Name"].str.contains("Julius")
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
Name: Name, dtype: bool
If you want to find both uppper and lower case examples, you can pass (?i)
to the regex parser to tell it to ignore cases.
df["Name"].str.contains("(?i)julius")
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
Name: Name, dtype: bool
Splitting by Strings
If you want to split on the first element, you have to reference str
again.
df["Name"].str.split(" ").str[0]
0 Titus
1 The
2 Richard
3 Romeo
4 A
5 King
6 Julius
7 Othello
8 Macbeth
Name: Name, dtype: object
Grouping and Counting by Label
df["Category"].groupby(df["Category"]).count()
Category
Comedy 2
History 2
Tragedy 5
Name: Category, dtype: int64
Cleaning Data
Bad values can come in many forms, including missing values, NaN, NA, ?, etc. Let’s go over how to find them
df = pd.read_csv(StringIO(csv_data))
df
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | 2.0 | 3.0 | 4.0 |
1 | 5.0 | 6.0 | NaN | 8.0 |
2 | 0.0 | 11.0 | 12.0 | NaN |
Now we can clean that dataset. One of the first things to look for is empty values
df.isnull()
A | B | C | D | |
---|---|---|---|---|
0 | False | False | False | False |
1 | False | False | True | False |
2 | False | False | False | True |
This works for tiny datasets, but if it is larger we won’t be able to see them all, so we can just sum up all the True values
df.isnull().sum()
A 0
B 0
C 1
D 1
dtype: int64
Adding
To add a new column, simply declare it and make sure it has the correct number of rows.
df["E"] = df["A"] * df["B"] + 1
df
A | B | C | D | E | |
---|---|---|---|---|---|
0 | 1.0 | 2.0 | 3.0 | 4.0 | 3.0 |
1 | 5.0 | 6.0 | NaN | 8.0 | 31.0 |
2 | 0.0 | 11.0 | 12.0 | NaN | 1.0 |
You can also add columns using conditional logic
import numpy as np
df["F"] = np.where(df["B"] > df["E"], 1, 0)
df
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2.0 | 3.0 | 4.0 | 3.0 | 0 |
1 | 5.0 | 6.0 | NaN | 8.0 | 31.0 | 0 |
2 | 0.0 | 11.0 | 12.0 | NaN | 1.0 | 1 |
Dropping
There are many ways to drop missing data. The basic dropna
drops every row that has a missing value.
df.dropna()
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2.0 | 3.0 | 4.0 | 3.0 | 0 |
Note that this doesn’t drop the data from the original DataFrame, it just returns a new one without the dropped values.
Alternatively, you could drop all the columns that have a missing value
df.dropna(axis=1)
A | B | E | F | |
---|---|---|---|---|
0 | 1.0 | 2.0 | 3.0 | 0 |
1 | 5.0 | 6.0 | 31.0 | 0 |
2 | 0.0 | 11.0 | 1.0 | 1 |
But you may want more precision than that. One way to do this is by dropping specific rows or columns by name. To drop columns, use axis=1
; to drop rows, use axis=0
, or leave it out, because 0 is the default value
df.drop([0,2])
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
1 | 5.0 | 6.0 | NaN | 8.0 | 31.0 | 0 |
df.drop(["D", "C"], axis=1)
A | B | E | F | |
---|---|---|---|---|
0 | 1.0 | 2.0 | 3.0 | 0 |
1 | 5.0 | 6.0 | 31.0 | 0 |
2 | 0.0 | 11.0 | 1.0 | 1 |
Selecting specific columns is commonly used when preparing a dataset for machine learning. For example, you might select the features and labels like so
features = df.drop("A", axis=1)
labels = df["A"]
You can also selective decide which rows to drop. You can decide to only drop rows where all columns are NaN.
df.dropna(how="all")
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2.0 | 3.0 | 4.0 | 3.0 | 0 |
1 | 5.0 | 6.0 | NaN | 8.0 | 31.0 | 0 |
2 | 0.0 | 11.0 | 12.0 | NaN | 1.0 | 1 |
Or drop all rows that have fewer than 4 non-NaN values
df.dropna(thresh=4)
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2.0 | 3.0 | 4.0 | 3.0 | 0 |
1 | 5.0 | 6.0 | NaN | 8.0 | 31.0 | 0 |
2 | 0.0 | 11.0 | 12.0 | NaN | 1.0 | 1 |
Or just drop rows where NaN appears in a specific column
df.dropna(subset=["D"])
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2.0 | 3.0 | 4.0 | 3.0 | 0 |
1 | 5.0 | 6.0 | NaN | 8.0 | 31.0 | 0 |
Finding Missing Values
You will need to clean up missing values before doing any machine learning on the data. Scikit-learn will give you an error if you try to run an algorithm on a dataset with missing values.
pandas has two commands to find missing values in a DataFrame, isna
and isnull
. These are 100% identical. The documentation literally says isnull = isna
.
df.isnull()
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | False |
1 | False | False | True | False | False | False |
2 | False | False | False | True | False | False |
Note that pandas will only call a certain type of data as missing. It’s possible that the user used a certain value to denote missing data but pandas doesn’t see it as missing. Below, there are many types of values that could have been intended to denote missing data, but pandas only sees the None
value as missing.
a = [False, 0, "na", "NaN"]
b = ["None", None, "nan", "NA"]
df_missing = pd.DataFrame(list(zip(a, b)), columns=(["a", "b"]))
df_missing
a | b | |
---|---|---|
0 | False | None |
1 | 0 | None |
2 | na | nan |
3 | NaN | NA |
df_missing.isnull()
a | b | |
---|---|---|
0 | False | False |
1 | False | True |
2 | False | False |
3 | False | False |
But when you import from a CSV, only the missing values will show up as missing.
bad_data = """A,B,C,D
None, none, 3.0, 4.0
Missing, 6.0,, 'None'
0.0, NaN, False,"""
df2 = pd.read_csv(StringIO(bad_data))
df2
A | B | C | D | |
---|---|---|---|---|
0 | None | none | 3.0 | 4.0 |
1 | Missing | 6.0 | NaN | 'None' |
2 | 0.0 | NaN | False | NaN |
df2.isnull()
A | B | C | D | |
---|---|---|---|---|
0 | False | False | False | False |
1 | False | False | True | False |
2 | False | False | False | True |
Interpolating
Sometimes deleting data isn’t the right approach. In these cases, you can interpolate the data. SKLearn has a good library for that.
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy="mean")
It uses the same fit and transform approach as other SKLearn tools.
df
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2.0 | 3.0 | 4.0 | 3.0 | 0 |
1 | 5.0 | 6.0 | NaN | 8.0 | 31.0 | 0 |
2 | 0.0 | 11.0 | 12.0 | NaN | 1.0 | 1 |
df.isnull()
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | False |
1 | False | False | True | False | False | False |
2 | False | False | False | True | False | False |
imp.fit(df)
imp.transform(df)
array([[ 1. , 2. , 3. , 4. , 3. , 0. ],
[ 5. , 6. , 7.5, 8. , 31. , 0. ],
[ 0. , 11. , 12. , 6. , 1. , 1. ]])
Data Types
Data are usually split into three types: continuous, ordinal, and categorical. With continuous data the values associated with the data have specific meaning. For example, the price of a good is a continuous value. Common statistical analyses are usually appropriate for continuous values. For example, you can take the average of two continuous values and the result is a meaningful description of the data.
For ordinal data, the values are ordered, but they don’t necessarily have meaning. T-shirt size is a good example of this, where a small is less than a medium, but it’s not less by a specific value. If you converted “small” into a specific measurement (e.g. convert it to inches) it would become a continuous value.
The last type is categorical, and that’s where the differences aren’t associated with any ranking or order. For example, shirtstyle is a categorical variable. We might have “polo” and “T-shirt”, but neither is greater than the other. We could (and will) assign numbers to these categories for machine learning, but we need to remember that those numbers aren’t associated with values as they would be in continuous or ordinal data.
df = pd.DataFrame([["green", "M", 20, "polo"], ["red", "L", 15, "T-shirt"], ["red", "S", 15, "polo"]])
df.columns = ["color", "size", "price", "type"]
df
color | size | price | type | |
---|---|---|---|---|
0 | green | M | 20 | polo |
1 | red | L | 15 | T-shirt |
2 | red | S | 15 | polo |
Because size can be ranked, it is ordinal. We can map it to numbers to make it easier for machine learning.
size_mapping = {"S": 1, "M": 2, "L": 3}
df["size"] = df["size"].map(size_mapping) # we can do a reverse mapping if we want to undo this in the end
df
color | size | price | type | |
---|---|---|---|---|
0 | green | 2 | 20 | polo |
1 | red | 3 | 15 | T-shirt |
2 | red | 1 | 15 | polo |
Even though the type is categorical, we can still map it to numbers
import numpy as np
class_mapping = {label: idx for idx, label in enumerate(np.unique(df["type"]))}
df["type"] = df["type"].map(class_mapping)
df
color | size | price | type | |
---|---|---|---|---|
0 | green | 2 | 20 | 1 |
1 | red | 3 | 15 | 0 |
2 | red | 1 | 15 | 1 |
After we perform whatever analysis we want to on the data, we could then invert that mapping.
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df["type"] = df["type"].map(inv_class_mapping)
df
color | size | price | type | |
---|---|---|---|---|
0 | green | 2 | 20 | polo |
1 | red | 3 | 15 | T-shirt |
2 | red | 1 | 15 | polo |
Another way to do this is the LabelEncoder class in scikit-learn
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df["type"].values) # fit_transform is a shortcut for calling fit and transform separately
print(y)
class_le.inverse_transform(y)
[1 0 1]
array(['polo', 'T-shirt', 'polo'], dtype=object)
Preparing Data for Machine Learning
It’s a good idea to do one-hot encoding of categorical variables before using them for machine learning. It can also be a good idea for ordinal variables as well, although that’s not always the case. A good rule of thumb is if the mean of two values isn’t a meaningful value, that category should be one-hot encoded.
The downside of treating ordinal data as categorical is that we throw away information about the relative order. The downside of treating it as continuous data is that we introduce a notion of distance. For example, if we set “small” as “1”, “medium” as “2”, and “large” as “3”, we’re telling the model that a large is 3 times a small. This isn’t a meaningful thing to say, so it can reduce model performance.
Shuffling Data
pandas makes it very easy to shuffle data.
df.sample(frac=1).reset_index(drop=True)
color | size | price | type | |
---|---|---|---|---|
0 | red | 3 | 15 | T-shirt |
1 | red | 1 | 15 | polo |
2 | green | 2 | 20 | polo |
Separating Features from Labels
pandas
also has a convenient way to extract the labels from a DataFrame, and that’s by using the pop
method. It will remove the specified column from the DataFrame and put it into a Series of its own.
labels = df.pop("price")
You can see that it’s no longer in the DataFrame:
df.head()
color | size | type | |
---|---|---|---|
0 | green | 2 | polo |
1 | red | 3 | T-shirt |
2 | red | 1 | polo |
labels
0 20
1 15
2 15
Name: price, dtype: int64
One-hot Encoding
pandas can also do one-hot encoding.
pd.get_dummies(df[["color", "size"]])
size | color_green | color_red | |
---|---|---|---|
0 | 2 | 1 | 0 |
1 | 3 | 0 | 1 |
2 | 1 | 0 | 1 |
Other Tricks
Don’t Truncate
pandas will by default truncate text that is over a certain limit. When working with text data, sometimes you don’t want this. Here’s how to stop it.
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", None)
pd.set_option
is a nice tool. But if for some reason you don’t want to use that, you can also set it directly.
pd.options.display.max_columns = None
pd.options.display.max_rows = None
Transposing
You can switch the rows and columns of a DataFrame by transposing it by adding .T
to the end.
df = pd.DataFrame(list(zip(names, years, categories)), columns=["Name", "Year", "Category"])
df
Name | Year | Category | |
---|---|---|---|
0 | Titus Andronicus | 1592 | Tragedy |
1 | The Comedy of Errors | 1594 | Comedy |
2 | Richard II | 1595 | History |
3 | Romeo and Juliet | 1595 | Tragedy |
4 | A Midsummer Night’s Dream | 1595 | Comedy |
5 | King John | 1596 | History |
6 | Julius Caesar | 1599 | Tragedy |
7 | Othello | 1604 | Tragedy |
8 | Macbeth | 1606 | Tragedy |
df = df.T
df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|---|
Name | Titus Andronicus | The Comedy of Errors | Richard II | Romeo and Juliet | A Midsummer Night’s Dream | King John | Julius Caesar | Othello | Macbeth |
Year | 1592 | 1594 | 1595 | 1595 | 1595 | 1596 | 1599 | 1604 | 1606 |
Category | Tragedy | Comedy | History | Tragedy | Comedy | History | Tragedy | Tragedy | Tragedy |