Dealing with Skewed Data

In this post I want to talk about some techniques for dealing with skewed data, especially left-skewed data. Left-skewed data is a bit of a rarity. It’s something you don’t see very often, kind of like a left-handed unicorn. It can also be difficult to work with if you’re not prepared.

Table of Contents

Right-skewed Data
- Transforming Right-skewed data
Left-skewed Data
- Transforming Left-skewed data

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import skewtest
from sklearn.datasets import load_breast_cancer

Let’s load some data from the breast cancer dataset.

data = load_breast_cancer()

feature_df = pd.DataFrame(data.data, columns=data.feature_names)
label_df = pd.DataFrame(data.target).rename(columns={0: "Diagnosis"})

df = pd.concat([feature_df, label_df], axis=1)

df.head()

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

5 rows × 31 columns

Let’s run a skewtest on the data to see how skewed it is.

skew_results = skewtest(df)
skew_results

SkewtestResult(statistic=array([ 7.9622136 ,  5.88215201,  8.27259482, 11.74893638,  4.290775  ,
        9.46629397, 10.59171403,  9.35871813,  6.45269024, 10.09216315,
       16.58088149, 11.75210248, 17.44633464, 21.14164598, 14.32036053,
       12.82429991, 20.62412273, 10.80823381, 13.91279644, 18.49190706,
        8.96247615,  4.64939677,  9.11009444, 12.65331114,  3.93417231,
       10.9492919 ,  9.23843169,  4.60113196, 10.75517885, 11.82359913,
       -4.90193592]), pvalue=array([1.68988430e-15, 4.04966147e-09, 1.31074540e-16, 7.15132884e-32,
       1.78050645e-05, 2.89948577e-21, 3.25574965e-26, 8.07099084e-21,
       1.09881815e-10, 5.98361587e-24, 9.58130444e-62, 6.88833407e-32,
       3.67043537e-68, 3.29387877e-99, 1.63273983e-46, 1.19864358e-37,
       1.66727337e-94, 3.14670056e-27, 5.29673293e-44, 2.39917722e-76,
       3.17467671e-19, 3.32907258e-06, 8.23102934e-20, 1.07257301e-36,
       8.34838797e-05, 6.69690805e-28, 2.50136485e-20, 4.20201106e-06,
       5.60252308e-27, 2.94779180e-32, 9.48967940e-07]))

It’s mostly right skewed. Only one is left skewed (the negative value

Right-skewed Data

There are lots of ways to deal with right-skewed data, so let’s start there. Let’s work with the most skewed example.

right_skewed_index = np.argmax(skew_results.statistic)

max_skewed_name = df.iloc[:, right_skewed_index].name
max_skewed_name

'area error'

right_skewed_data = df[max_skewed_name]

plt.hist(right_skewed_data);

png

That’s a nice right-skewed distribution. Many machine learning techniques have difficulty modeling such distributions. However, certain models like mixture density networks are able to handle arbitrary distributions by using a combination of multiple probability distributions. Additionally, there are models such as tree-based models, like Random Forest, XGBoost and LightGBM, which are not affected by the distribution of the data, and can model the data directly. These models are considered as non-parametric models, they don’t assume a specific distribution of the data and they can capture complex non-linear relationship in the data.

However, the simplest thing to do is often to transform the data so that it is normally distributed.

Transforming Right-skewed data

My first approach is to take the log. This is often all you need.

plt.hist(np.log(right_skewed_data));

png

Although you can also use the square root.

plt.hist(np.sqrt(right_skewed_data));

png

If the dataset has outliers that are really far away that you don’t want to remove, you might think about taking the cubic root. This data is so highly skewed that it might work pretty well.

plt.hist(np.cbrt(right_skewed_data));

png

Nice! I would still go with np.log but this would work as well.

Left-skewed Data

OK, now let’s look at the left-skewed data. Though you might think the same techniques would apply, this isn’t the case.

Let’s look at our skewtest results again.

skew_results.statistic

array([ 7.9622136 ,  5.88215201,  8.27259482, 11.74893638,  4.290775  ,
        9.46629397, 10.59171403,  9.35871813,  6.45269024, 10.09216315,
       16.58088149, 11.75210248, 17.44633464, 21.14164598, 14.32036053,
       12.82429991, 20.62412273, 10.80823381, 13.91279644, 18.49190706,
        8.96247615,  4.64939677,  9.11009444, 12.65331114,  3.93417231,
       10.9492919 ,  9.23843169,  4.60113196, 10.75517885, 11.82359913,
       -4.90193592])

left_skewed_index = np.argmin(skew_results.statistic)

df.iloc[:, left_skewed_index].name

'Diagnosis'

Oh. The “left-skewed” data is the label. This probably isn’t what we’re looking for. Let’s plot it anyway.

df.iloc[:, left_skewed_index].plot.hist();

png

OK, you can see why it showed up as left-skewed in the test, but this just goes to show why it’s important to actually look at the histogram and not only a skewness test. So in this entire dataset, there isn’t a single example of a left-skewed feature. This goes back to what I was saying about it being less common.

I think the best thing to do is just make sure left-skewed data for our current dataset. This is a bit of a spoiler for the final answer, but, oh well.

We can flip the skew by throwing a minus sign in front of the data. I also add the max value to keep the numbers positive, and an addition 1 to keep everything at least 1 (in case I need to log transform it later). So the final flip ends up looking like this:

left_skewed_data = 1 + int(right_skewed_data.max()) - right_skewed_data

plt.hist(left_skewed_data);

png

There we go! The mythical left-skewed data! (I swear you do see it in real life sometimes.)

Transforming Left-skewed data

Let’s try the transforms as before.

plt.hist(np.log(left_skewed_data));

png

Wow, that looks awful. Let’s try another.

plt.hist(np.sqrt(left_skewed_data));

png

Still worse than the original.

OK, why don’t we try the reverse? If log and sqrt don’t work, what about exp and square?

plt.hist(np.exp(left_skewed_data));

png

It’s… somehow the opposite yet the worst of all? Note the scale is 1e232.

plt.hist(np.square(left_skewed_data));

png

At least that’s not abominable! It’s still distinctly left-skewed, but at least it still looks like something.

So what’s the solution here? As you might have guessed from how we got the data, the best solution is to throw a negative sign in front of it (and do the other steps) so that’s it’s right-skewed. Then you can treat it as normal right-skewed data. As we saw before, switching the skew looks like this:

right_skewed_data = 1 + int(left_skewed_data.max()) - left_skewed_data

plt.hist(right_skewed_data);

png

There we go. There’s our right-skewed data back. From here we can do any of the techniques described earlier.