YData Profiling used to be know as pandas-profiling, but it’s moved to a new name and new home. I talked about in my post on cleaning DNA splice junction data, but since it was kind of buried in the post and the name has changed, I thought I would do a quick tutorial that only covers YData Profiling. There isn’t much to demo here because it does so much of the work for you, but I’ll still go over it.

ydata_profiling is a Python library that generates comprehensive reports from a pandas or Spark DataFrame. These reports include detailed exploratory data analysis, providing insights into missing data, variable distributions, correlations, and much more. It’s a powerful tool for initial data investigation and can save a lot of time in the data understanding phase of a project.

import pandas as pd
import seaborn as sns
from ydata_profiling import ProfileReport  # pip install ydata_profiling if you haven't installed it

Let’s grab the Titanic dataset.

df = sns.load_dataset('titanic')
df
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 0 2 male 27.0 0 0 13.0000 S Second man True NaN Southampton no True
887 1 1 female 19.0 0 0 30.0000 S First woman False B Southampton yes True
888 0 3 female NaN 1 2 23.4500 S Third woman False NaN Southampton no False
889 1 1 male 26.0 0 0 30.0000 C First man True C Cherbourg yes True
890 0 3 male 32.0 0 0 7.7500 Q Third man True NaN Queenstown no True

891 rows × 15 columns

# Generate the profile report
profile = ProfileReport(df, title='Titanic Data Report', explorative=False)

To view the report, you can use profile.to_widgets(). That doesn’t display well on the blog, so instead I’ll use profile.to_widgets().

# profile.to_widgets()# doesn't work well on blog, but my recommended use in a notebook
profile.to_notebook_iframe()
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]



Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

The generated report includes sections on:

  • Overview: Summary statistics, dataset size, and variable types.
  • Variables: Each variable’s (column’s) distributions, missing values, and unique counts.
  • Interactions: Visualizations to explore potential relationships between variables.
  • Correlations: Statistical measures of how variables relate to each other.
  • Missing Values: Detailed analysis of missing data in the dataset.
  • Sample: A preview of the dataset rows.

You can even customize it. I usually leave all this defaults, but this gives you an example.

profile = ProfileReport(df, 
                        title='Titanic Data Report', 
                        explorative=True,
                        dark_mode=True,  # Enable dark mode for the report
                        correlations={
                            "pearson": {"calculate": False},  # Disable Pearson correlation
                            "spearman": {"calculate": True},  # Enable Spearman correlation
                            "kendall": {"calculate": True}    # Enable Kendall correlation
                        },
                        duplicates={"calculate": False},  # Disable duplicate row detection
                        interactions={"continuous": True},  # Enable interactions for continuous variables
                        missing_diagrams={
                            "bar": True,  # Show bar chart for missing values
                            "matrix": True,  # Show matrix of missing values
                            "heatmap": True,  # Show heatmap of missing value correlations
                            "dendrogram": True  # Show dendrogram of missing value correlations
                        },
                        samples={"head": 10, "tail": 10},  # Show first and last 10 rows of the dataset
                        sensitive=True,  # Treat all variables as sensitive, minimizing detailed output
                        sort="ascending",  # Sort variables in ascending order
                        pool_size=2,  # Number of processes to use for parallel processing
                        variables={
                            "descriptions": {
                                "Age": "Age of the passenger",
                                "Sex": "Gender of the passenger",
                                # Add custom descriptions for variables
                            }
                        },
                        minimal=True,  # Generate a minimal report for faster rendering
                        progress_bar=True,  # Display a progress bar during report generation
                        infer_dtypes=False,  # Disable automatic datatype inference
                        html={"style": {"full_width": True, "theme": "flatly"}}  # Apply full width and 'flatly' theme for HTML output
                        )

profile.to_notebook_iframe()
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]



Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]