This is Part II of two posts demonstrating how to get test metrics from a highly unbalanced dataset. In Part I, I showed how you could theoretically estimate the precision, recall, and f1 score on highly unbalanced data. In this part, I’ll do the same thing with a real dataset. To do so, I’ll use the Adult Income Dataset.

import numpy as np
import pandas as pd
from hyperopt import STATUS_OK, Trials, fmin, hp, space_eval, tpe
from hyperopt.pyll.base import scope
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

Let’s download the dataset using sklearn’s fetch_openml.

dataset = fetch_openml("adult", version=2, target_column=None, as_frame=True)
df = dataset.data

Let’s look at what we’ve got.

df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country class
0 25.0 Private 226802.0 11th 7.0 Never-married Machine-op-inspct Own-child Black Male 0.0 0.0 40.0 United-States <=50K
1 38.0 Private 89814.0 HS-grad 9.0 Married-civ-spouse Farming-fishing Husband White Male 0.0 0.0 50.0 United-States <=50K
2 28.0 Local-gov 336951.0 Assoc-acdm 12.0 Married-civ-spouse Protective-serv Husband White Male 0.0 0.0 40.0 United-States >50K
3 44.0 Private 160323.0 Some-college 10.0 Married-civ-spouse Machine-op-inspct Husband Black Male 7688.0 0.0 40.0 United-States >50K
4 18.0 NaN 103497.0 Some-college 10.0 Never-married NaN Own-child White Female 0.0 0.0 30.0 United-States <=50K
df['class'].value_counts()
<=50K    37155
>50K     11687
Name: class, dtype: int64

Just under a quarter of the data are of people with salaries above 50k. That’s unbalanced but for this, we want to look at a much more extreme example. We’ll make it so that people in the >50K class are really rare.

Caveat: If I was really trying to build an algorithm to classify this data, I would go over the data in great detail. I can already see that education is listed as 11th in one case and HS-grad in another, so it needs to be cleaned up. But in this case, I’m going to ignore all of that and jump right to encoding the data so it can be used in a model. It would also be good practice to split off test data before doing label encoding, but we’ll skip that as well.

df = df.apply(LabelEncoder().fit_transform)
df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country class
0 8 3 19329 1 6 4 6 3 2 1 0 0 39 38 0
1 21 3 4212 11 8 2 4 0 4 1 0 0 49 38 0
2 11 1 25340 7 11 2 10 0 4 1 0 0 39 38 1
3 27 3 11201 15 9 2 6 0 2 1 98 0 39 38 1
4 1 8 5411 15 9 4 14 3 4 0 0 0 29 38 0

Now let’s split up the data by class so we can make the unbalance more extreme.

rich_df = df[df['class'].astype(bool)]
poor_df = df[~df['class'].astype(bool)]

We’ll split off some of the rich examples so we can add them in later.

split_rich_df = rich_df.sample(50)
split_rich_df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country class
14269 38 5 17156 9 12 2 11 0 4 1 0 0 39 38 1
20938 36 3 6579 9 12 2 3 0 4 1 0 0 44 38 1
20702 36 3 4131 11 8 2 3 0 4 1 122 0 39 38 1
39547 49 5 27502 10 15 2 11 0 4 1 0 79 24 38 1
22756 21 3 13081 0 5 4 9 1 4 1 0 89 87 38 1

We’ll remove those from the dataset so we don’t have to worry about data leakage.

rich_df = rich_df.drop(split_rich_df.index)
len(poor_df)
37155

Now let’s create our dataset. We’ll create a really high imbalance, like 1,000 : 1.

num_pos = 30
num_neg = 30000
representative_rich_df = rich_df.sample(num_pos, random_state=42)
representative_poor_df = poor_df.sample(num_neg, random_state=42)
rich_df = rich_df.drop(representative_rich_df.index)
poor_df = poor_df.drop(representative_poor_df.index)
df = pd.concat([representative_rich_df, representative_poor_df])

df contains all the positives and negatives.

df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country class
15455 13 3 14744 11 8 2 10 0 4 1 0 0 39 38 1
18855 17 3 17049 9 12 2 9 0 4 1 98 0 49 38 1
13376 26 3 10902 11 8 2 5 5 4 0 0 0 47 38 1
1945 19 5 7851 15 9 2 11 0 4 1 81 0 49 38 1
3875 7 3 16177 8 10 2 9 0 4 1 0 0 39 38 1

This looks adequate for feeding into a model.

df_train, df_test = train_test_split(df, random_state=42)
df_test.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country class
42029 4 8 3469 15 9 4 14 3 4 1 0 0 34 38 0
27469 33 3 8739 11 8 2 6 2 1 0 0 0 39 2 0
39161 26 3 26001 2 7 2 0 0 2 1 0 0 39 38 0
13781 7 5 5851 15 9 2 11 0 4 1 0 0 39 38 0
19096 33 3 20253 4 2 2 13 0 4 1 0 0 36 25 0

Let’s record the number of positives and negatives from our test set to reuse later.

num_pos_test = np.sum(df_test['class'] == 1)
num_neg_test = np.sum(df_test['class'] == 0)

Let’s add the rest of the data to the train set. It won’t be completely balanced but it will still be close enough.

df_train_balanced = pd.concat([df_train, rich_df])

Now let’s shuffle the data.

df_train_balanced = df_train_balanced.sample(frac=1, random_state=42)
y_train = df_train_balanced.pop('class')
y_test = df_test.pop('class')
X_train = df_train_balanced
X_test = df_test

To train a model I’m just going to copy code from the High Performance Models on Tabular Data post.

def train_xgb(params):
    clf=XGBClassifier(**params, eval_metric='logloss')
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, pred>0.5)

    return {'loss': -accuracy, 'status': STATUS_OK}
num_trials = 50
xgb_space={'max_depth': scope.int(hp.quniform("max_depth", 3, 18, 1)),
        'gamma': hp.uniform ('gamma', 1,9),
        'reg_alpha' : hp.quniform('reg_alpha', 40,180,1),
        'reg_lambda' : hp.uniform('reg_lambda', 0,1),
        'colsample_bytree' : hp.uniform('colsample_bytree', 0.5,1),
        'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
        'n_estimators': 180,
        'seed': 0
    }
trials = Trials()

fmin(fn = train_xgb,
    space = xgb_space,
    algo = tpe.suggest,
    max_evals = num_trials,
    trials = trials,
    rstate=np.random.default_rng(42),
    verbose=False)
{'colsample_bytree': 0.9419741621278979,
 'gamma': 5.7022639336370755,
 'max_depth': 13.0,
 'min_child_weight': 9.0,
 'reg_alpha': 129.0,
 'reg_lambda': 0.551693541360825}
best_hyperparams = space_eval(xgb_space, trials.argmin)
best_hyperparams
{'colsample_bytree': 0.9419741621278979,
 'gamma': 5.7022639336370755,
 'max_depth': 13,
 'min_child_weight': 9.0,
 'n_estimators': 180,
 'reg_alpha': 129.0,
 'reg_lambda': 0.551693541360825,
 'seed': 0}
xgb_clf = XGBClassifier(**best_hyperparams)

xgb_clf.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1,
              colsample_bytree=0.9419741621278979, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None,
              gamma=5.7022639336370755, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=13, max_leaves=0,
              min_child_weight=9.0, missing=nan, monotone_constraints='()',
              n_estimators=180, n_jobs=0, num_parallel_tree=1, predictor='auto',
              random_state=0, reg_alpha=129.0, reg_lambda=0.551693541360825, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now we’ve got a trained model. Let’s see how well it does.

xgb_preds = xgb_clf.predict(X_test)

def get_metrics(y_test, y_pred):
    """
    Print standard sklearn metrics
    """
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
    print(f"Precision: {precision_score(y_test, y_pred):.2%}")
    print(f"Recall: {recall_score(y_test, y_pred):.2%}")
    print(f"F1: {f1_score(y_test, y_pred):.2%}")
get_metrics(y_test, xgb_preds)
Accuracy: 89.88%
Precision: 0.65%
Recall: 83.33%
F1: 1.30%

We have some metrics now, but how much can we trust them. Let’s look at the number of positive examples.

num_pos_test = np.sum(y_test)
print(f"{num_pos_test=}")
num_pos_test=6

Only six! So all of our metrics are possibly pretty far off their true values. There aren’t enough positives to get a good measure of the model quality.

Just like we did in Part I, we’ll additional test cases. Here, it’ll be those that we split off earlier. This will reduce the error on our metrics because we have more samples.

split_labels = split_rich_df.pop('class')
X_test_added = pd.concat([X_test, split_rich_df])
y_test_added = pd.concat([y_test, split_labels])
xgb_preds_added = xgb_clf.predict(X_test_added)

get_metrics(y_test_added, xgb_preds_added)

Accuracy: 89.71%
Precision: 4.65%
Recall: 66.07%
F1: 8.69%

We got metrics, but they are not all apples-to-apples comparisons. In particular, the precision (and therefore F1 score) is far higher. That’s because we’ve changed the ratio of positives to negatives in the dataset. To get back the original values, we’ll have to use the tricks we did in Part I.

def get_model_stats(y_true, y_pred):
    pos_indices = [i for i , x in enumerate(y_true) if x == 1]
    neg_indices = [i for i , x in enumerate(y_true) if x == 0]
    preds_for_pos_labels = [y_pred[i] for i in pos_indices]
    preds_for_neg_labels = [y_pred[i] for i in neg_indices]
    percent_pos_correct = sum(preds_for_pos_labels) / len(preds_for_pos_labels)
    percent_neg_correct = np.sum(np.array(preds_for_neg_labels) == 0) / len(preds_for_neg_labels)
    return percent_pos_correct, percent_neg_correct
percent_pos_correct, percent_neg_correct = get_model_stats(y_test_added, xgb_preds_added)
print(f"{percent_pos_correct=}")
print(f"{percent_neg_correct=}")
percent_pos_correct=0.6607142857142857
percent_neg_correct=0.8988269794721407

These are the stats for the model. If we wanted to convert them to f1 for a specific distribution, we could do that.

def np_float_to_int(x):
    return np.rint(x).astype(int)

def get_y_pred(num_pos, num_neg, percent_pos_correct, percent_neg_correct):
    return (
        [1] * np_float_to_int(num_pos * percent_pos_correct)
        + [0] * np_float_to_int(num_pos - num_pos * percent_pos_correct)
        + [0] * np_float_to_int(num_neg * percent_neg_correct)
        + [1] * np_float_to_int(num_neg - num_neg * percent_neg_correct)
    )
y_pred_recreated = get_y_pred(num_pos_test, num_neg_test, percent_pos_correct, percent_neg_correct)
y_test_recreated = [1] * num_pos_test + [0] * num_neg_test
get_metrics(y_test_recreated, y_pred_recreated)
Accuracy: 89.86%
Precision: 0.52%
Recall: 66.67%
F1: 1.04%

These are better estimates of the model’s performance and are more precise than if it only used the true unbalanced test set.