This post is a tutorial on working with tabular data using FastAI. One of FastAI’s biggest contributions in working with tabular data is the ease with which embeddings can be used for categorical variables. I have found that using embeddings for categorical variables results in significantly better models than the alternatives (e.g. one-hot encoding). I have found that the combination of embeddings and neural networks reach very high performance with tabular data.
from fastai.tabular.all import *
from pyxtend import struct
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
We’ll use the UCI Adult Data Set where the task is to predict whether a person makes over 50k a year. FastAI makes downloading the dataset easy.
path = untar_data(URLs.ADULT_SAMPLE)
Once it’s downloaded we can load it into a DataFrame.
df = pd.read_csv(path/'adult.csv')
Many times machine learning practitioners are dealing with datasets that have already been split into train and test sets. In this case we have all of the data, but I am going to split the data into a train and test split to simulate a pre-defined split.
Part I
train_df, test_df = train_test_split(df, random_state=42)
Let’s take a look at the data.
train_df.head(10)
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
29 | 42 | Private | 70055 | 11th | 7.0 | Married-civ-spouse | NaN | Husband | White | Male | 0 | 0 | 45 | United-States | <50k |
12181 | 25 | Private | 253267 | Some-college | 10.0 | Married-civ-spouse | Adm-clerical | Husband | Black | Male | 0 | 1902 | 36 | United-States | >=50k |
18114 | 53 | Self-emp-not-inc | 145419 | 1st-4th | 2.0 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 7688 | 0 | 67 | Italy | >=50k |
4278 | 37 | State-gov | 354929 | Assoc-acdm | 12.0 | Divorced | Protective-serv | Not-in-family | Black | Male | 0 | 0 | 38 | United-States | <50k |
12050 | 25 | Private | 404616 | Masters | 14.0 | Married-civ-spouse | Farming-fishing | Not-in-family | White | Male | 0 | 0 | 99 | United-States | >=50k |
14371 | 20 | Private | 303565 | Some-college | 10.0 | Never-married | Handlers-cleaners | Own-child | Black | Male | 0 | 0 | 40 | Germany | <50k |
32541 | 24 | Private | 241857 | Some-college | 10.0 | Never-married | Adm-clerical | Not-in-family | Black | Female | 0 | 0 | 35 | United-States | <50k |
3362 | 48 | Private | 398843 | Some-college | 10.0 | Separated | Sales | Unmarried | Black | Female | 0 | 0 | 35 | United-States | <50k |
19009 | 46 | Private | 109227 | Some-college | 10.0 | Divorced | Exec-managerial | Unmarried | White | Female | 0 | 0 | 70 | United-States | <50k |
16041 | 26 | Private | 171114 | Bachelors | 13.0 | Never-married | Exec-managerial | Own-child | White | Female | 0 | 0 | 40 | United-States | <50k |
train_df.describe()
age | fnlwgt | education-num | capital-gain | capital-loss | hours-per-week | |
---|---|---|---|---|---|---|
count | 24420.000000 | 2.442000e+04 | 24057.000000 | 24420.000000 | 24420.000000 | 24420.000000 |
mean | 38.578911 | 1.895367e+05 | 10.058361 | 1066.490254 | 86.502457 | 40.393366 |
std | 13.696620 | 1.043135e+05 | 2.580948 | 7243.366967 | 400.848415 | 12.380526 |
min | 17.000000 | 1.228500e+04 | 1.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 28.000000 | 1.183052e+05 | 9.000000 | 0.000000 | 0.000000 | 40.000000 |
50% | 37.000000 | 1.784825e+05 | 10.000000 | 0.000000 | 0.000000 | 40.000000 |
75% | 48.000000 | 2.366420e+05 | 12.000000 | 0.000000 | 0.000000 | 45.000000 |
max | 90.000000 | 1.455435e+06 | 16.000000 | 99999.000000 | 4356.000000 | 99.000000 |
train_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24420 entries, 29 to 23654
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 24420 non-null int64
1 workclass 24420 non-null object
2 fnlwgt 24420 non-null int64
3 education 24420 non-null object
4 education-num 24057 non-null float64
5 marital-status 24420 non-null object
6 occupation 24031 non-null object
7 relationship 24420 non-null object
8 race 24420 non-null object
9 sex 24420 non-null object
10 capital-gain 24420 non-null int64
11 capital-loss 24420 non-null int64
12 hours-per-week 24420 non-null int64
13 native-country 24420 non-null object
14 salary 24420 non-null object
dtypes: float64(1), int64(5), object(9)
memory usage: 3.0+ MB
train_df['salary'].value_counts()
<50k 18537
>=50k 5883
Name: salary, dtype: int64
The first thing to note is that there is missing data. We’ll have to deal with that; fortunately, FastAI has tools that make this easy. Also, it looks like we have both continuous and categorical data. We’ll split those apart so we can put the categorical data through embeddings. Also, the data is highly imbalanced. We could correct for this but I’ll skip over that for now. The imbalance isn’t so bad that it would completely stop the network from learning.
Note that the variable we’re trying to predict, salary, is in the DataFrame. That’s fine, we’ll just need to tell cont_cat_split
what the dependent variable is so it isn’t included in the training variables.
dep_var = 'salary'
continuous_vars, categorical_vars = cont_cat_split(train_df, dep_var=dep_var)
The cont_cat_split
function usually works well, but I always double check the results to see that they make sense.
train_df[continuous_vars].nunique()
age 72
fnlwgt 17545
education-num 16
capital-gain 116
capital-loss 90
hours-per-week 93
dtype: int64
train_df[categorical_vars].nunique()
workclass 9
education 16
marital-status 7
occupation 15
relationship 6
race 5
sex 2
native-country 41
dtype: int64
Let’s think about the data. One thing that sticks out to me is that native-country
has 41 different unique values in the train set. This means there’s a good chance there will be a new native-country
in the test set (or after we deploy it!). This will be a problem if we use embeddings. There are ways to deal with unknown categories and embeddings but it’s easiest to simply remove it.
categorical_vars.remove('native-country')
categorical_vars
['workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex']
Now we need to decide what preprocessing we need to do. We noted there is missing data, so we’ll need to use FillMissing
to clean that up. Also, we should always Normalize
the data. Finally, we’ll use Categorify
to transform the categorical variables to be similar to pd.Categorical
.
preprocessing = [Categorify, FillMissing, Normalize]
We’ve already split our data because we’re simulating that it’s already been split for us. But we will still need to pass a splitter to TabularPandas
, so we’ll make one that puts everything in the train set and nothing in the validation set.
def no_split(obj):
"""
Put everything in the train set
"""
return list(range(len(obj))), []
splits = no_split(range_of(train_df))
struct(splits)
{tuple: [{list: [int, int, int, '...24420 total']}, {list: []}]}
There are a lot of things that don’t work as well in FastAI if you don’t have a validation set, like get_preds
and the output from training, so I’m going to add it here. This is simple to do.
full_df = pd.concat([train_df, test_df])
val_indices = list(range(len(train_df),len(train_df) + len(test_df)))
ind_splitter = IndexSplitter(val_indices)
splits = ind_splitter(full_df)
Now we need to create a TabularPandas
for our data. A TabularPandas
is wrapper for a pandas DataFrame where the continuous, categorical, and dependent variables are known. FastAI uses lots of inheritance, and the inheritances aren’t always intuitive to me, so it’s good to look at the method resolution order to get a sense of what the class is supposed to do. You can do so like this:
TabularPandas.__mro__
(fastai.tabular.core.TabularPandas,
fastai.tabular.core.Tabular,
fastcore.foundation.CollBase,
fastcore.basics.GetAttr,
fastai.data.core.FilteredBase,
object)
If we just wanted to pass the train set, we would use train_df
and no_split(range_of(train_df))
. But we’re going to pass the validation set as well, so we’ll use full_df
and ind_splitter(full_df)
.
df_wrapper = TabularPandas(full_df, procs=preprocessing, cat_names=categorical_vars, cont_names=continuous_vars,
y_names=dep_var, splits=splits)
Let’s look at some examples to make sure they look right. All the data should be ready for deep learning.
If we wanted to get the data in the familiar X_train, y_train, X_test, y_test
format a scikit-learn model, all we have to do is this:
X_train, y_train = df_wrapper.train.xs, df_wrapper.train.ys.values.ravel()
X_test, y_test = df_wrapper.valid.xs, df_wrapper.valid.ys.values.ravel()
Now the data are in a DataFrame fully ready to be used in a scikit-learn or xgboost model. We can explore the data to see this.
X_train.head()
workclass | education | marital-status | occupation | relationship | race | sex | education-num_na | age | fnlwgt | education-num | capital-gain | capital-loss | hours-per-week | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
29 | 5 | 2 | 3 | 0 | 1 | 5 | 2 | 1 | 0.249781 | -1.145433 | -1.193588 | -0.147240 | -0.215803 | 0.372095 |
12181 | 5 | 16 | 3 | 2 | 1 | 3 | 2 | 1 | -0.991426 | 0.610962 | -0.022445 | -0.147240 | 4.529230 | -0.354868 |
18114 | 7 | 4 | 3 | 5 | 1 | 5 | 2 | 1 | 1.052916 | -0.422942 | -3.145492 | 0.914167 | -0.215803 | 2.149115 |
4278 | 8 | 8 | 1 | 12 | 2 | 3 | 2 | 1 | -0.115280 | 1.585564 | 0.758317 | -0.147240 | -0.215803 | -0.193321 |
12050 | 5 | 13 | 3 | 6 | 2 | 5 | 2 | 1 | -0.991426 | 2.061897 | 1.539079 | -0.147240 | -0.215803 | 4.733873 |
We can see that the continuous variables are all normalized. This looks good!
y_train[:5]
array([0, 1, 1, 0, 1], dtype=int8)
Continuing with FastAI
If we wanted to use the data on a FastAI model, we’d need to create DataLoaders
.
batch_size = 128
dls = df_wrapper.dataloaders(bs=batch_size)
Let’s look at our data to make sure it looks right.
batch = next(iter(dls.train))
We are expecting three objects in each batch: the categorical variables, the continuous variables, and the labels. Let’s take a look.
len(batch)
3
cat_vars, cont_vars, labels = batch
cat_vars[:5]
tensor([[ 5, 12, 5, 9, 3, 3, 1, 1],
[ 2, 10, 3, 11, 1, 5, 2, 1],
[ 5, 12, 5, 14, 2, 5, 2, 1],
[ 5, 2, 7, 9, 5, 5, 1, 1],
[ 5, 12, 3, 5, 6, 5, 1, 1]])
cont_vars[:5]
tensor([[-0.5534, 0.0047, -0.4128, -0.1472, -0.2158, -0.0318],
[ 0.3228, 1.7249, 1.1487, -0.1472, -0.2158, -0.0318],
[-0.1883, 1.5283, -0.4128, -0.1472, -0.2158, 0.7760],
[ 0.1768, 1.4803, -1.1936, -0.1472, -0.2158, -0.0318],
[-0.0423, -0.0218, -0.4128, -0.1472, -0.2158, 1.1798]])
labels[:5]
tensor([[0],
[1],
[0],
[0],
[0]], dtype=torch.int8)
Looks good!
Now we make a learner. This data isn’t very complex so we’ll use a relatively small model for it.
learn = tabular_learner(dls, layers=[20,10])
Let’s fit the model.
learn.fit(4, 1e-2)
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.331239 | 0.322867 | 00:02 |
1 | 0.323588 | 0.318893 | 00:01 |
2 | 0.320338 | 0.325158 | 00:01 |
3 | 0.324844 | 0.321952 | 00:01 |
If we didn’t pass a validation set we wouldn’t have gotten any valid_loss
.
Now we can save the model.
save_path = Path(os.environ['MODELS']) / 'adult_dataset'
os.makedirs(save_path, exist_ok=True)
learn.save(save_path / 'baseline_neural_network')
Path('I:/Models/adult_dataset/baseline_neural_network.pth')
Part II
To fully simulate this being a separate test, I’m going to reload the model from the weights. Note that we would have to create a learn
object before we load the weights. In this case we’ll use the same learn
as before.
learn.load(save_path / 'baseline_neural_network')
<fastai.tabular.learner.TabularLearner at 0x1e79f48c730>
Let’s look at the model and make sure it loaded correctly.
learn.summary()
TabularModel (Input shape: 128 x 8)
============================================================================
Layer (type) Output Shape Param # Trainable
============================================================================
128 x 6
Embedding 60 True
____________________________________________________________________________
128 x 8
Embedding 136 True
____________________________________________________________________________
128 x 5
Embedding 40 True
____________________________________________________________________________
128 x 8
Embedding 128 True
____________________________________________________________________________
128 x 5
Embedding 35 True
____________________________________________________________________________
128 x 4
Embedding 24 True
____________________________________________________________________________
128 x 3
Embedding 9 True
Embedding 9 True
Dropout
BatchNorm1d 12 True
____________________________________________________________________________
128 x 20
Linear 960 True
ReLU
BatchNorm1d 40 True
____________________________________________________________________________
128 x 10
Linear 200 True
ReLU
BatchNorm1d 20 True
____________________________________________________________________________
128 x 2
Linear 22 True
____________________________________________________________________________
Total params: 1,695
Total trainable params: 1,695
Total non-trainable params: 0
Optimizer used: <function Adam at 0x000001E7A3A0D670>
Loss function: FlattenedLoss of CrossEntropyLoss()
Model unfrozen
Callbacks:
- TrainEvalCallback
- Recorder
- ProgressCallback
Looks good. Let’s look at the test data.
test_df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14160 | 30 | Private | 81282 | HS-grad | 9.0 | Never-married | Other-service | Unmarried | White | Female | 0 | 0 | 40 | United-States | <50k |
27048 | 38 | Federal-gov | 172571 | Some-college | 10.0 | Divorced | Adm-clerical | Not-in-family | White | Male | 0 | 0 | 40 | United-States | >=50k |
28868 | 40 | Private | 223548 | HS-grad | 9.0 | Married-civ-spouse | Adm-clerical | Husband | White | Male | 0 | 0 | 40 | Mexico | <50k |
5667 | 28 | Local-gov | 191177 | Masters | 14.0 | Married-civ-spouse | Prof-specialty | Wife | White | Female | 0 | 0 | 20 | United-States | >=50k |
7827 | 31 | Private | 210562 | HS-grad | 9.0 | Married-civ-spouse | Transport-moving | Husband | White | Male | 0 | 0 | 65 | United-States | <50k |
Because the data is imbalanced we’ll have to adjust our baseline. A completely “dumb” classifier that only guesses the most common class will be right more than 50% of the time. Let’s see what that percentage is.
test_df['salary'].value_counts()
<50k 6183
>=50k 1958
Name: salary, dtype: int64
test_df['salary'].value_counts()[0] / np.sum(test_df['salary'].value_counts())
0.7594890062645867
OK, so 75% is our baseline that we have to beat.
The data looks like we expected. Now we follow a similar process as what we did before.
test_splits = no_split(range_of(test_df))
test_df_wrapper = TabularPandas(test_df, preprocessing, categorical_vars, continuous_vars, splits=test_splits, y_names=dep_var)
Now we can turn that into a DataLoaders
object.
Note: If your test set size isn’t divisible by your batch size you’ll need to
drop_last
. If I don’t I get an error, although I’ve only noticed this happening with the test set.
test_dls = test_df_wrapper.dataloaders(batch_size, drop_last=False)
Now we’ve got everything in place to make predictions.
preds, ground_truth = learn.get_preds(dl=test_dls.train)
Let’s see what they look like.
preds[:5]
tensor([[0.9943, 0.0057],
[0.9559, 0.0441],
[0.6239, 0.3761],
[0.4550, 0.5450],
[0.7262, 0.2738]])
ground_truth[:5]
tensor([[0],
[1],
[0],
[1],
[0]], dtype=torch.int8)
Depending on your last layer, converting the prediction into an actual prediction will be different. In this case have a probability associated with each value, so to get the final prediction we need to take an argmax. Had you just had one value in the last layer, you could extract the label prediction with np.rint(preds)
.
You can test this by seeing that each prediction sums to 1.
preds.sum(1)
tensor([1.0000, 1.0000, 1.0000, ..., 1.0000, 1.0000, 1.0000])
torch.argmax(preds, dim=1)
tensor([0, 0, 0, ..., 1, 0, 1])
Let’s see what our final accuracy is on the test set.
accuracy_score(ground_truth, torch.argmax(preds, dim=1))
0.851001105515293