Test Set Metrics on Unbalanced Datasets: Part II

This is Part II of two posts demonstrating how to get test metrics from a highly unbalanced dataset. In Part I, I showed how you could theoretically estimate the precision, recall, and f1 score on highly unbalanced data. In this part, I’ll do the same thing with a real dataset. To do so, I’ll use the Adult Income Dataset.

import numpy as np
import pandas as pd
from hyperopt import STATUS_OK, Trials, fmin, hp, space_eval, tpe
from hyperopt.pyll.base import scope
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

Let’s download the dataset using sklearn’s fetch_openml.

dataset = fetch_openml("adult", version=2, target_column=None, as_frame=True)
df = dataset.data

Let’s look at what we’ve got.

df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	class
0	25.0	Private	226802.0	11th	7.0	Never-married	Machine-op-inspct	Own-child	Black	Male	0.0	40.0	United-States	<=50K
1	38.0	Private	89814.0	HS-grad	9.0	Married-civ-spouse	Farming-fishing	Husband	White	Male	0.0	50.0	United-States	<=50K
2	28.0	Local-gov	336951.0	Assoc-acdm	12.0	Married-civ-spouse	Protective-serv	Husband	White	Male	0.0	40.0	United-States	>50K
3	44.0	Private	160323.0	Some-college	10.0	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7688.0	40.0	United-States	>50K
4	18.0	NaN	103497.0	Some-college	10.0	Never-married	NaN	Own-child	White	Female	0.0	30.0	United-States	<=50K

df['class'].value_counts()

<=50K    37155
>50K     11687
Name: class, dtype: int64

Just under a quarter of the data are of people with salaries above 50k. That’s unbalanced but for this, we want to look at a much more extreme example. We’ll make it so that people in the >50K class are really rare.

Caveat: If I was really trying to build an algorithm to classify this data, I would go over the data in great detail. I can already see that education is listed as 11th in one case and HS-grad in another, so it needs to be cleaned up. But in this case, I’m going to ignore all of that and jump right to encoding the data so it can be used in a model. It would also be good practice to split off test data before doing label encoding, but we’ll skip that as well.

df = df.apply(LabelEncoder().fit_transform)

df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	class
0	8	3	19329	1	6	4	6	3	2	1	0	39	38	0
1	21	3	4212	11	8	2	4	0	4	1	0	49	38	0
2	11	1	25340	7	11	2	10	0	4	1	0	39	38	1
3	27	3	11201	15	9	2	6	0	2	1	98	39	38	1
4	1	8	5411	15	9	4	14	3	4	0	0	29	38	0

Now let’s split up the data by class so we can make the unbalance more extreme.

rich_df = df[df['class'].astype(bool)]
poor_df = df[~df['class'].astype(bool)]

We’ll split off some of the rich examples so we can add them in later.

split_rich_df = rich_df.sample(50)

split_rich_df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	class
14269	38	5	17156	9	12	2	11	0	4	1	0	0	39	38	1
20938	36	3	6579	9	12	2	3	0	4	1	0	0	44	38	1
20702	36	3	4131	11	8	2	3	0	4	1	122	0	39	38	1
39547	49	5	27502	10	15	2	11	0	4	1	0	79	24	38	1
22756	21	3	13081	0	5	4	9	1	4	1	0	89	87	38	1

We’ll remove those from the dataset so we don’t have to worry about data leakage.

rich_df = rich_df.drop(split_rich_df.index)

len(poor_df)

Now let’s create our dataset. We’ll create a really high imbalance, like 1,000 : 1.

num_pos = 30
num_neg = 30000

representative_rich_df = rich_df.sample(num_pos, random_state=42)
representative_poor_df = poor_df.sample(num_neg, random_state=42)

rich_df = rich_df.drop(representative_rich_df.index)
poor_df = poor_df.drop(representative_poor_df.index)

df = pd.concat([representative_rich_df, representative_poor_df])

df contains all the positives and negatives.

df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	class
15455	13	3	14744	11	8	2	10	0	4	1	0	39	38	1
18855	17	3	17049	9	12	2	9	0	4	1	98	49	38	1
13376	26	3	10902	11	8	2	5	5	4	0	0	47	38	1
1945	19	5	7851	15	9	2	11	0	4	1	81	49	38	1
3875	7	3	16177	8	10	2	9	0	4	1	0	39	38	1

This looks adequate for feeding into a model.

df_train, df_test = train_test_split(df, random_state=42)

df_test.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	hours-per-week	native-country
42029	4	8	3469	15	9	4	14	3	4	1	34	38
27469	33	3	8739	11	8	2	6	2	1	0	39	2
39161	26	3	26001	2	7	2	0	0	2	1	39	38
13781	7	5	5851	15	9	2	11	0	4	1	39	38
19096	33	3	20253	4	2	2	13	0	4	1	36	25

Let’s record the number of positives and negatives from our test set to reuse later.

num_pos_test = np.sum(df_test['class'] == 1)
num_neg_test = np.sum(df_test['class'] == 0)

Let’s add the rest of the data to the train set. It won’t be completely balanced but it will still be close enough.

df_train_balanced = pd.concat([df_train, rich_df])

Now let’s shuffle the data.

df_train_balanced = df_train_balanced.sample(frac=1, random_state=42)

y_train = df_train_balanced.pop('class')
y_test = df_test.pop('class')

X_train = df_train_balanced
X_test = df_test

To train a model I’m just going to copy code from the High Performance Models on Tabular Data post.

def train_xgb(params):
    clf=XGBClassifier(**params, eval_metric='logloss')
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, pred>0.5)

    return {'loss': -accuracy, 'status': STATUS_OK}

num_trials = 50

xgb_space={'max_depth': scope.int(hp.quniform("max_depth", 3, 18, 1)),
        'gamma': hp.uniform ('gamma', 1,9),
        'reg_alpha' : hp.quniform('reg_alpha', 40,180,1),
        'reg_lambda' : hp.uniform('reg_lambda', 0,1),
        'colsample_bytree' : hp.uniform('colsample_bytree', 0.5,1),
        'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
        'n_estimators': 180,
        'seed': 0
    }

trials = Trials()

fmin(fn = train_xgb,
    space = xgb_space,
    algo = tpe.suggest,
    max_evals = num_trials,
    trials = trials,
    rstate=np.random.default_rng(42),
    verbose=False)

{'colsample_bytree': 0.9419741621278979,
 'gamma': 5.7022639336370755,
 'max_depth': 13.0,
 'min_child_weight': 9.0,
 'reg_alpha': 129.0,
 'reg_lambda': 0.551693541360825}

best_hyperparams = space_eval(xgb_space, trials.argmin)
best_hyperparams

{'colsample_bytree': 0.9419741621278979,
 'gamma': 5.7022639336370755,
 'max_depth': 13,
 'min_child_weight': 9.0,
 'n_estimators': 180,
 'reg_alpha': 129.0,
 'reg_lambda': 0.551693541360825,
 'seed': 0}

xgb_clf = XGBClassifier(**best_hyperparams)

xgb_clf.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1,
              colsample_bytree=0.9419741621278979, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None,
              gamma=5.7022639336370755, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=13, max_leaves=0,
              min_child_weight=9.0, missing=nan, monotone_constraints='()',
              n_estimators=180, n_jobs=0, num_parallel_tree=1, predictor='auto',
              random_state=0, reg_alpha=129.0, reg_lambda=0.551693541360825, ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now we’ve got a trained model. Let’s see how well it does.

xgb_preds = xgb_clf.predict(X_test)

def get_metrics(y_test, y_pred):
    """
    Print standard sklearn metrics
    """
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
    print(f"Precision: {precision_score(y_test, y_pred):.2%}")
    print(f"Recall: {recall_score(y_test, y_pred):.2%}")
    print(f"F1: {f1_score(y_test, y_pred):.2%}")

get_metrics(y_test, xgb_preds)

Accuracy: 89.88%
Precision: 0.65%
Recall: 83.33%
F1: 1.30%

We have some metrics now, but how much can we trust them. Let’s look at the number of positive examples.

num_pos_test = np.sum(y_test)
print(f"{num_pos_test=}")

num_pos_test=6

Only six! So all of our metrics are possibly pretty far off their true values. There aren’t enough positives to get a good measure of the model quality.

Just like we did in Part I, we’ll additional test cases. Here, it’ll be those that we split off earlier. This will reduce the error on our metrics because we have more samples.

split_labels = split_rich_df.pop('class')

X_test_added = pd.concat([X_test, split_rich_df])
y_test_added = pd.concat([y_test, split_labels])

xgb_preds_added = xgb_clf.predict(X_test_added)

get_metrics(y_test_added, xgb_preds_added)

Accuracy: 89.71%
Precision: 4.65%
Recall: 66.07%
F1: 8.69%

We got metrics, but they are not all apples-to-apples comparisons. In particular, the precision (and therefore F1 score) is far higher. That’s because we’ve changed the ratio of positives to negatives in the dataset. To get back the original values, we’ll have to use the tricks we did in Part I.

def get_model_stats(y_true, y_pred):
    pos_indices = [i for i , x in enumerate(y_true) if x == 1]
    neg_indices = [i for i , x in enumerate(y_true) if x == 0]
    preds_for_pos_labels = [y_pred[i] for i in pos_indices]
    preds_for_neg_labels = [y_pred[i] for i in neg_indices]
    percent_pos_correct = sum(preds_for_pos_labels) / len(preds_for_pos_labels)
    percent_neg_correct = np.sum(np.array(preds_for_neg_labels) == 0) / len(preds_for_neg_labels)
    return percent_pos_correct, percent_neg_correct

percent_pos_correct, percent_neg_correct = get_model_stats(y_test_added, xgb_preds_added)

print(f"{percent_pos_correct=}")
print(f"{percent_neg_correct=}")

percent_pos_correct=0.6607142857142857
percent_neg_correct=0.8988269794721407

These are the stats for the model. If we wanted to convert them to f1 for a specific distribution, we could do that.

def np_float_to_int(x):
    return np.rint(x).astype(int)

def get_y_pred(num_pos, num_neg, percent_pos_correct, percent_neg_correct):
    return (
        [1] * np_float_to_int(num_pos * percent_pos_correct)
        + [0] * np_float_to_int(num_pos - num_pos * percent_pos_correct)
        + [0] * np_float_to_int(num_neg * percent_neg_correct)
        + [1] * np_float_to_int(num_neg - num_neg * percent_neg_correct)
    )

y_pred_recreated = get_y_pred(num_pos_test, num_neg_test, percent_pos_correct, percent_neg_correct)

y_test_recreated = [1] * num_pos_test + [0] * num_neg_test

get_metrics(y_test_recreated, y_pred_recreated)

Accuracy: 89.86%
Precision: 0.52%
Recall: 66.67%
F1: 1.04%

These are better estimates of the model’s performance and are more precise than if it only used the true unbalanced test set.