This is Part II of two posts demonstrating how to get test metrics from a highly unbalanced dataset. In Part I, I showed how you could theoretically estimate the precision, recall, and f1 score on highly unbalanced data. In this part, I’ll do the same thing with a real dataset. To do so, I’ll use the Adult Income Dataset.
import numpy as np
import pandas as pd
from hyperopt import STATUS_OK, Trials, fmin, hp, space_eval, tpe
from hyperopt.pyll.base import scope
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
Let’s download the dataset using sklearn’s fetch_openml
.
dataset = fetch_openml("adult", version=2, target_column=None, as_frame=True)
df = dataset.data
Let’s look at what we’ve got.
df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 25.0 | Private | 226802.0 | 11th | 7.0 | Never-married | Machine-op-inspct | Own-child | Black | Male | 0.0 | 0.0 | 40.0 | United-States | <=50K |
1 | 38.0 | Private | 89814.0 | HS-grad | 9.0 | Married-civ-spouse | Farming-fishing | Husband | White | Male | 0.0 | 0.0 | 50.0 | United-States | <=50K |
2 | 28.0 | Local-gov | 336951.0 | Assoc-acdm | 12.0 | Married-civ-spouse | Protective-serv | Husband | White | Male | 0.0 | 0.0 | 40.0 | United-States | >50K |
3 | 44.0 | Private | 160323.0 | Some-college | 10.0 | Married-civ-spouse | Machine-op-inspct | Husband | Black | Male | 7688.0 | 0.0 | 40.0 | United-States | >50K |
4 | 18.0 | NaN | 103497.0 | Some-college | 10.0 | Never-married | NaN | Own-child | White | Female | 0.0 | 0.0 | 30.0 | United-States | <=50K |
df['class'].value_counts()
<=50K 37155
>50K 11687
Name: class, dtype: int64
Just under a quarter of the data are of people with salaries above 50k. That’s unbalanced but for this, we want to look at a much more extreme example. We’ll make it so that people in the >50K class are really rare.
Caveat: If I was really trying to build an algorithm to classify this data, I would go over the data in great detail. I can already see that education is listed as 11th in one case and HS-grad in another, so it needs to be cleaned up. But in this case, I’m going to ignore all of that and jump right to encoding the data so it can be used in a model. It would also be good practice to split off test data before doing label encoding, but we’ll skip that as well.
df = df.apply(LabelEncoder().fit_transform)
df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8 | 3 | 19329 | 1 | 6 | 4 | 6 | 3 | 2 | 1 | 0 | 0 | 39 | 38 | 0 |
1 | 21 | 3 | 4212 | 11 | 8 | 2 | 4 | 0 | 4 | 1 | 0 | 0 | 49 | 38 | 0 |
2 | 11 | 1 | 25340 | 7 | 11 | 2 | 10 | 0 | 4 | 1 | 0 | 0 | 39 | 38 | 1 |
3 | 27 | 3 | 11201 | 15 | 9 | 2 | 6 | 0 | 2 | 1 | 98 | 0 | 39 | 38 | 1 |
4 | 1 | 8 | 5411 | 15 | 9 | 4 | 14 | 3 | 4 | 0 | 0 | 0 | 29 | 38 | 0 |
Now let’s split up the data by class so we can make the unbalance more extreme.
rich_df = df[df['class'].astype(bool)]
poor_df = df[~df['class'].astype(bool)]
We’ll split off some of the rich examples so we can add them in later.
split_rich_df = rich_df.sample(50)
split_rich_df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14269 | 38 | 5 | 17156 | 9 | 12 | 2 | 11 | 0 | 4 | 1 | 0 | 0 | 39 | 38 | 1 |
20938 | 36 | 3 | 6579 | 9 | 12 | 2 | 3 | 0 | 4 | 1 | 0 | 0 | 44 | 38 | 1 |
20702 | 36 | 3 | 4131 | 11 | 8 | 2 | 3 | 0 | 4 | 1 | 122 | 0 | 39 | 38 | 1 |
39547 | 49 | 5 | 27502 | 10 | 15 | 2 | 11 | 0 | 4 | 1 | 0 | 79 | 24 | 38 | 1 |
22756 | 21 | 3 | 13081 | 0 | 5 | 4 | 9 | 1 | 4 | 1 | 0 | 89 | 87 | 38 | 1 |
We’ll remove those from the dataset so we don’t have to worry about data leakage.
rich_df = rich_df.drop(split_rich_df.index)
len(poor_df)
37155
Now let’s create our dataset. We’ll create a really high imbalance, like 1,000 : 1.
num_pos = 30
num_neg = 30000
representative_rich_df = rich_df.sample(num_pos, random_state=42)
representative_poor_df = poor_df.sample(num_neg, random_state=42)
rich_df = rich_df.drop(representative_rich_df.index)
poor_df = poor_df.drop(representative_poor_df.index)
df = pd.concat([representative_rich_df, representative_poor_df])
df
contains all the positives and negatives.
df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15455 | 13 | 3 | 14744 | 11 | 8 | 2 | 10 | 0 | 4 | 1 | 0 | 0 | 39 | 38 | 1 |
18855 | 17 | 3 | 17049 | 9 | 12 | 2 | 9 | 0 | 4 | 1 | 98 | 0 | 49 | 38 | 1 |
13376 | 26 | 3 | 10902 | 11 | 8 | 2 | 5 | 5 | 4 | 0 | 0 | 0 | 47 | 38 | 1 |
1945 | 19 | 5 | 7851 | 15 | 9 | 2 | 11 | 0 | 4 | 1 | 81 | 0 | 49 | 38 | 1 |
3875 | 7 | 3 | 16177 | 8 | 10 | 2 | 9 | 0 | 4 | 1 | 0 | 0 | 39 | 38 | 1 |
This looks adequate for feeding into a model.
df_train, df_test = train_test_split(df, random_state=42)
df_test.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
42029 | 4 | 8 | 3469 | 15 | 9 | 4 | 14 | 3 | 4 | 1 | 0 | 0 | 34 | 38 | 0 |
27469 | 33 | 3 | 8739 | 11 | 8 | 2 | 6 | 2 | 1 | 0 | 0 | 0 | 39 | 2 | 0 |
39161 | 26 | 3 | 26001 | 2 | 7 | 2 | 0 | 0 | 2 | 1 | 0 | 0 | 39 | 38 | 0 |
13781 | 7 | 5 | 5851 | 15 | 9 | 2 | 11 | 0 | 4 | 1 | 0 | 0 | 39 | 38 | 0 |
19096 | 33 | 3 | 20253 | 4 | 2 | 2 | 13 | 0 | 4 | 1 | 0 | 0 | 36 | 25 | 0 |
Let’s record the number of positives and negatives from our test set to reuse later.
num_pos_test = np.sum(df_test['class'] == 1)
num_neg_test = np.sum(df_test['class'] == 0)
Let’s add the rest of the data to the train set. It won’t be completely balanced but it will still be close enough.
df_train_balanced = pd.concat([df_train, rich_df])
Now let’s shuffle the data.
df_train_balanced = df_train_balanced.sample(frac=1, random_state=42)
y_train = df_train_balanced.pop('class')
y_test = df_test.pop('class')
X_train = df_train_balanced
X_test = df_test
To train a model I’m just going to copy code from the High Performance Models on Tabular Data post.
def train_xgb(params):
clf=XGBClassifier(**params, eval_metric='logloss')
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, pred>0.5)
return {'loss': -accuracy, 'status': STATUS_OK}
num_trials = 50
xgb_space={'max_depth': scope.int(hp.quniform("max_depth", 3, 18, 1)),
'gamma': hp.uniform ('gamma', 1,9),
'reg_alpha' : hp.quniform('reg_alpha', 40,180,1),
'reg_lambda' : hp.uniform('reg_lambda', 0,1),
'colsample_bytree' : hp.uniform('colsample_bytree', 0.5,1),
'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
'n_estimators': 180,
'seed': 0
}
trials = Trials()
fmin(fn = train_xgb,
space = xgb_space,
algo = tpe.suggest,
max_evals = num_trials,
trials = trials,
rstate=np.random.default_rng(42),
verbose=False)
{'colsample_bytree': 0.9419741621278979,
'gamma': 5.7022639336370755,
'max_depth': 13.0,
'min_child_weight': 9.0,
'reg_alpha': 129.0,
'reg_lambda': 0.551693541360825}
best_hyperparams = space_eval(xgb_space, trials.argmin)
best_hyperparams
{'colsample_bytree': 0.9419741621278979,
'gamma': 5.7022639336370755,
'max_depth': 13,
'min_child_weight': 9.0,
'n_estimators': 180,
'reg_alpha': 129.0,
'reg_lambda': 0.551693541360825,
'seed': 0}
xgb_clf = XGBClassifier(**best_hyperparams)
xgb_clf.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None, colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.9419741621278979, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, gamma=5.7022639336370755, gpu_id=-1, grow_policy='depthwise', importance_type=None, interaction_constraints='', learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=13, max_leaves=0, min_child_weight=9.0, missing=nan, monotone_constraints='()', n_estimators=180, n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=129.0, reg_lambda=0.551693541360825, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None, colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.9419741621278979, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, gamma=5.7022639336370755, gpu_id=-1, grow_policy='depthwise', importance_type=None, interaction_constraints='', learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=13, max_leaves=0, min_child_weight=9.0, missing=nan, monotone_constraints='()', n_estimators=180, n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=129.0, reg_lambda=0.551693541360825, ...)
Now we’ve got a trained model. Let’s see how well it does.
xgb_preds = xgb_clf.predict(X_test)
def get_metrics(y_test, y_pred):
"""
Print standard sklearn metrics
"""
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(f"Precision: {precision_score(y_test, y_pred):.2%}")
print(f"Recall: {recall_score(y_test, y_pred):.2%}")
print(f"F1: {f1_score(y_test, y_pred):.2%}")
get_metrics(y_test, xgb_preds)
Accuracy: 89.88%
Precision: 0.65%
Recall: 83.33%
F1: 1.30%
We have some metrics now, but how much can we trust them. Let’s look at the number of positive examples.
num_pos_test = np.sum(y_test)
print(f"{num_pos_test=}")
num_pos_test=6
Only six! So all of our metrics are possibly pretty far off their true values. There aren’t enough positives to get a good measure of the model quality.
Just like we did in Part I, we’ll additional test cases. Here, it’ll be those that we split off earlier. This will reduce the error on our metrics because we have more samples.
split_labels = split_rich_df.pop('class')
X_test_added = pd.concat([X_test, split_rich_df])
y_test_added = pd.concat([y_test, split_labels])
xgb_preds_added = xgb_clf.predict(X_test_added)
get_metrics(y_test_added, xgb_preds_added)
Accuracy: 89.71%
Precision: 4.65%
Recall: 66.07%
F1: 8.69%
We got metrics, but they are not all apples-to-apples comparisons. In particular, the precision (and therefore F1 score) is far higher. That’s because we’ve changed the ratio of positives to negatives in the dataset. To get back the original values, we’ll have to use the tricks we did in Part I.
def get_model_stats(y_true, y_pred):
pos_indices = [i for i , x in enumerate(y_true) if x == 1]
neg_indices = [i for i , x in enumerate(y_true) if x == 0]
preds_for_pos_labels = [y_pred[i] for i in pos_indices]
preds_for_neg_labels = [y_pred[i] for i in neg_indices]
percent_pos_correct = sum(preds_for_pos_labels) / len(preds_for_pos_labels)
percent_neg_correct = np.sum(np.array(preds_for_neg_labels) == 0) / len(preds_for_neg_labels)
return percent_pos_correct, percent_neg_correct
percent_pos_correct, percent_neg_correct = get_model_stats(y_test_added, xgb_preds_added)
print(f"{percent_pos_correct=}")
print(f"{percent_neg_correct=}")
percent_pos_correct=0.6607142857142857
percent_neg_correct=0.8988269794721407
These are the stats for the model. If we wanted to convert them to f1 for a specific distribution, we could do that.
def np_float_to_int(x):
return np.rint(x).astype(int)
def get_y_pred(num_pos, num_neg, percent_pos_correct, percent_neg_correct):
return (
[1] * np_float_to_int(num_pos * percent_pos_correct)
+ [0] * np_float_to_int(num_pos - num_pos * percent_pos_correct)
+ [0] * np_float_to_int(num_neg * percent_neg_correct)
+ [1] * np_float_to_int(num_neg - num_neg * percent_neg_correct)
)
y_pred_recreated = get_y_pred(num_pos_test, num_neg_test, percent_pos_correct, percent_neg_correct)
y_test_recreated = [1] * num_pos_test + [0] * num_neg_test
get_metrics(y_test_recreated, y_pred_recreated)
Accuracy: 89.86%
Precision: 0.52%
Recall: 66.67%
F1: 1.04%
These are better estimates of the model’s performance and are more precise than if it only used the true unbalanced test set.