YData Profiling used to be know as pandas-profiling, but it’s moved to a new name and new home. I talked about in my post on cleaning DNA splice junction data, but since it was kind of buried in the post and the name has changed, I thought I would do a quick tutorial that only covers YData Profiling. There isn’t much to demo here because it does so much of the work for you, but I’ll still go over it.
ydata_profiling
is a Python library that generates comprehensive reports from a pandas or Spark DataFrame. These reports include detailed exploratory data analysis, providing insights into missing data, variable distributions, correlations, and much more. It’s a powerful tool for initial data investigation and can save a lot of time in the data understanding phase of a project.
import pandas as pd
import seaborn as sns
from ydata_profiling import ProfileReport # pip install ydata_profiling if you haven't installed it
Let’s grab the Titanic dataset.
df = sns.load_dataset('titanic')
df
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 0 | 2 | male | 27.0 | 0 | 0 | 13.0000 | S | Second | man | True | NaN | Southampton | no | True |
887 | 1 | 1 | female | 19.0 | 0 | 0 | 30.0000 | S | First | woman | False | B | Southampton | yes | True |
888 | 0 | 3 | female | NaN | 1 | 2 | 23.4500 | S | Third | woman | False | NaN | Southampton | no | False |
889 | 1 | 1 | male | 26.0 | 0 | 0 | 30.0000 | C | First | man | True | C | Cherbourg | yes | True |
890 | 0 | 3 | male | 32.0 | 0 | 0 | 7.7500 | Q | Third | man | True | NaN | Queenstown | no | True |
891 rows × 15 columns
# Generate the profile report
profile = ProfileReport(df, title='Titanic Data Report', explorative=False)
To view the report, you can use profile.to_widgets()
. That doesn’t display well on the blog, so instead I’ll use profile.to_widgets()
.
# profile.to_widgets()# doesn't work well on blog, but my recommended use in a notebook
profile.to_notebook_iframe()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
The generated report includes sections on:
- Overview: Summary statistics, dataset size, and variable types.
- Variables: Each variable’s (column’s) distributions, missing values, and unique counts.
- Interactions: Visualizations to explore potential relationships between variables.
- Correlations: Statistical measures of how variables relate to each other.
- Missing Values: Detailed analysis of missing data in the dataset.
- Sample: A preview of the dataset rows.
You can even customize it. I usually leave all this defaults, but this gives you an example.
profile = ProfileReport(df,
title='Titanic Data Report',
explorative=True,
dark_mode=True, # Enable dark mode for the report
correlations={
"pearson": {"calculate": False}, # Disable Pearson correlation
"spearman": {"calculate": True}, # Enable Spearman correlation
"kendall": {"calculate": True} # Enable Kendall correlation
},
duplicates={"calculate": False}, # Disable duplicate row detection
interactions={"continuous": True}, # Enable interactions for continuous variables
missing_diagrams={
"bar": True, # Show bar chart for missing values
"matrix": True, # Show matrix of missing values
"heatmap": True, # Show heatmap of missing value correlations
"dendrogram": True # Show dendrogram of missing value correlations
},
samples={"head": 10, "tail": 10}, # Show first and last 10 rows of the dataset
sensitive=True, # Treat all variables as sensitive, minimizing detailed output
sort="ascending", # Sort variables in ascending order
pool_size=2, # Number of processes to use for parallel processing
variables={
"descriptions": {
"Age": "Age of the passenger",
"Sex": "Gender of the passenger",
# Add custom descriptions for variables
}
},
minimal=True, # Generate a minimal report for faster rendering
progress_bar=True, # Display a progress bar during report generation
infer_dtypes=False, # Disable automatic datatype inference
html={"style": {"full_width": True, "theme": "flatly"}} # Apply full width and 'flatly' theme for HTML output
)
profile.to_notebook_iframe()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]