In this notebook, we’ll demonstrate some data exploration techniques using the famous iris dataset. In the second notebook, we’ll use this data set to visualize a bunch of machine learning algorithms.

Table of contents

Iris

import numpy as np
import pandas as pd
import seaborn as sns
sns.set(style="dark")
import matplotlib.pyplot as plt

Load the Data

Load the data using seaborn. The dataset is also available from Scikit-learn and Keras, but it loads as a pandas DataFrame from seaborn, saving a step.

df = sns.load_dataset("iris")

Explore

Let’s look at what features are in the data set.

df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Then check how many of each species is recorded.

df['species'].value_counts()
setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

And let’s see what types of values are in the dataset and do some basic statistics on the set.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
df.describe()
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Fortunately, the data set is really clean so we can jump right into visualization.

Visualize

Let’s see how the different categories compare with each other.

hue_order = df['species'].unique()[::-1]
palette = sns.color_palette('bright')
sns.pairplot(df, hue="species", hue_order=hue_order, palette=palette, markers=["o", "s", "D"], diag_kind='kde');

png

Nothing looks noticeably wrong with the data, and there aren’t any outliers that would confound a model.

Petal length and petal width appear to be good variables to distinguish the species, especially sestosa. Let’s take a closer look at those.

sns.FacetGrid(df, hue='species', hue_order=hue_order, palette=palette, height=8) \
    .map(plt.scatter, 'petal_length','petal_width') \
    .add_legend();

png

OK, it will be very easy to extract the setosa from the others.

Let’s see what the best way to separate versicolor from virginica is. We’ll a new dataframe with just the two we’re focusing on.

# Exclude setosa
vvdf = df[df['species'] != 'setosa']
sns.pairplot(vvdf, hue="species", hue_order=hue_order, palette=palette, diag_kind='kde');

png

OK, these are not as easy to separate. We make have to do the best that we can. In Part II, we’ll look at how we can use machine learning models to analyze the data.