This post is a tutorial on working with tabular data using FastAI. One of FastAI’s biggest contributions in working with tabular data is the ease with which embeddings can be used for categorical variables. I have found that using embeddings for categorical variables results in significantly better models than the alternatives (e.g. one-hot encoding). I have found that the combination of embeddings and neural networks reach very high performance with tabular data.

from fastai.tabular.all import *
from pyxtend import struct
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

We’ll use the UCI Adult Data Set where the task is to predict whether a person makes over 50k a year. FastAI makes downloading the dataset easy.

path = untar_data(URLs.ADULT_SAMPLE)

Once it’s downloaded we can load it into a DataFrame.

df = pd.read_csv(path/'adult.csv')

Many times machine learning practitioners are dealing with datasets that have already been split into train and test sets. In this case we have all of the data, but I am going to split the data into a train and test split to simulate a pre-defined split.

Part I

train_df, test_df = train_test_split(df, random_state=42)

Let’s take a look at the data.

train_df.head(10)
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
29 42 Private 70055 11th 7.0 Married-civ-spouse NaN Husband White Male 0 0 45 United-States <50k
12181 25 Private 253267 Some-college 10.0 Married-civ-spouse Adm-clerical Husband Black Male 0 1902 36 United-States >=50k
18114 53 Self-emp-not-inc 145419 1st-4th 2.0 Married-civ-spouse Exec-managerial Husband White Male 7688 0 67 Italy >=50k
4278 37 State-gov 354929 Assoc-acdm 12.0 Divorced Protective-serv Not-in-family Black Male 0 0 38 United-States <50k
12050 25 Private 404616 Masters 14.0 Married-civ-spouse Farming-fishing Not-in-family White Male 0 0 99 United-States >=50k
14371 20 Private 303565 Some-college 10.0 Never-married Handlers-cleaners Own-child Black Male 0 0 40 Germany <50k
32541 24 Private 241857 Some-college 10.0 Never-married Adm-clerical Not-in-family Black Female 0 0 35 United-States <50k
3362 48 Private 398843 Some-college 10.0 Separated Sales Unmarried Black Female 0 0 35 United-States <50k
19009 46 Private 109227 Some-college 10.0 Divorced Exec-managerial Unmarried White Female 0 0 70 United-States <50k
16041 26 Private 171114 Bachelors 13.0 Never-married Exec-managerial Own-child White Female 0 0 40 United-States <50k
train_df.describe()
age fnlwgt education-num capital-gain capital-loss hours-per-week
count 24420.000000 2.442000e+04 24057.000000 24420.000000 24420.000000 24420.000000
mean 38.578911 1.895367e+05 10.058361 1066.490254 86.502457 40.393366
std 13.696620 1.043135e+05 2.580948 7243.366967 400.848415 12.380526
min 17.000000 1.228500e+04 1.000000 0.000000 0.000000 1.000000
25% 28.000000 1.183052e+05 9.000000 0.000000 0.000000 40.000000
50% 37.000000 1.784825e+05 10.000000 0.000000 0.000000 40.000000
75% 48.000000 2.366420e+05 12.000000 0.000000 0.000000 45.000000
max 90.000000 1.455435e+06 16.000000 99999.000000 4356.000000 99.000000
train_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24420 entries, 29 to 23654
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             24420 non-null  int64  
 1   workclass       24420 non-null  object 
 2   fnlwgt          24420 non-null  int64  
 3   education       24420 non-null  object 
 4   education-num   24057 non-null  float64
 5   marital-status  24420 non-null  object 
 6   occupation      24031 non-null  object 
 7   relationship    24420 non-null  object 
 8   race            24420 non-null  object 
 9   sex             24420 non-null  object 
 10  capital-gain    24420 non-null  int64  
 11  capital-loss    24420 non-null  int64  
 12  hours-per-week  24420 non-null  int64  
 13  native-country  24420 non-null  object 
 14  salary          24420 non-null  object 
dtypes: float64(1), int64(5), object(9)
memory usage: 3.0+ MB
train_df['salary'].value_counts()
<50k     18537
>=50k     5883
Name: salary, dtype: int64

The first thing to note is that there is missing data. We’ll have to deal with that; fortunately, FastAI has tools that make this easy. Also, it looks like we have both continuous and categorical data. We’ll split those apart so we can put the categorical data through embeddings. Also, the data is highly imbalanced. We could correct for this but I’ll skip over that for now. The imbalance isn’t so bad that it would completely stop the network from learning.

Note that the variable we’re trying to predict, salary, is in the DataFrame. That’s fine, we’ll just need to tell cont_cat_split what the dependent variable is so it isn’t included in the training variables.

dep_var = 'salary'
continuous_vars, categorical_vars = cont_cat_split(train_df, dep_var=dep_var)

The cont_cat_split function usually works well, but I always double check the results to see that they make sense.

train_df[continuous_vars].nunique()
age                  72
fnlwgt            17545
education-num        16
capital-gain        116
capital-loss         90
hours-per-week       93
dtype: int64
train_df[categorical_vars].nunique()
workclass          9
education         16
marital-status     7
occupation        15
relationship       6
race               5
sex                2
native-country    41
dtype: int64

Let’s think about the data. One thing that sticks out to me is that native-country has 41 different unique values in the train set. This means there’s a good chance there will be a new native-country in the test set (or after we deploy it!). This will be a problem if we use embeddings. There are ways to deal with unknown categories and embeddings but it’s easiest to simply remove it.

categorical_vars.remove('native-country')
categorical_vars
['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex']

Now we need to decide what preprocessing we need to do. We noted there is missing data, so we’ll need to use FillMissing to clean that up. Also, we should always Normalize the data. Finally, we’ll use Categorify to transform the categorical variables to be similar to pd.Categorical.

preprocessing = [Categorify, FillMissing, Normalize]

We’ve already split our data because we’re simulating that it’s already been split for us. But we will still need to pass a splitter to TabularPandas, so we’ll make one that puts everything in the train set and nothing in the validation set.

def no_split(obj):
    """
    Put everything in the train set
    """
    return list(range(len(obj))), []
splits = no_split(range_of(train_df))
struct(splits)
{tuple: [{list: [int, int, int, '...24420 total']}, {list: []}]}

There are a lot of things that don’t work as well in FastAI if you don’t have a validation set, like get_preds and the output from training, so I’m going to add it here. This is simple to do.

full_df = pd.concat([train_df, test_df])
val_indices = list(range(len(train_df),len(train_df) + len(test_df)))
ind_splitter = IndexSplitter(val_indices)
splits = ind_splitter(full_df) 

Now we need to create a TabularPandas for our data. A TabularPandas is wrapper for a pandas DataFrame where the continuous, categorical, and dependent variables are known. FastAI uses lots of inheritance, and the inheritances aren’t always intuitive to me, so it’s good to look at the method resolution order to get a sense of what the class is supposed to do. You can do so like this:

TabularPandas.__mro__
(fastai.tabular.core.TabularPandas,
 fastai.tabular.core.Tabular,
 fastcore.foundation.CollBase,
 fastcore.basics.GetAttr,
 fastai.data.core.FilteredBase,
 object)

If we just wanted to pass the train set, we would use train_df and no_split(range_of(train_df)). But we’re going to pass the validation set as well, so we’ll use full_df and ind_splitter(full_df).

df_wrapper = TabularPandas(full_df, procs=preprocessing, cat_names=categorical_vars, cont_names=continuous_vars,
                   y_names=dep_var, splits=splits)

Let’s look at some examples to make sure they look right. All the data should be ready for deep learning.

If we wanted to get the data in the familiar X_train, y_train, X_test, y_test format a scikit-learn model, all we have to do is this:

X_train, y_train = df_wrapper.train.xs, df_wrapper.train.ys.values.ravel()
X_test, y_test = df_wrapper.valid.xs, df_wrapper.valid.ys.values.ravel()

Now the data are in a DataFrame fully ready to be used in a scikit-learn or xgboost model. We can explore the data to see this.

X_train.head()
workclass education marital-status occupation relationship race sex education-num_na age fnlwgt education-num capital-gain capital-loss hours-per-week
29 5 2 3 0 1 5 2 1 0.249781 -1.145433 -1.193588 -0.147240 -0.215803 0.372095
12181 5 16 3 2 1 3 2 1 -0.991426 0.610962 -0.022445 -0.147240 4.529230 -0.354868
18114 7 4 3 5 1 5 2 1 1.052916 -0.422942 -3.145492 0.914167 -0.215803 2.149115
4278 8 8 1 12 2 3 2 1 -0.115280 1.585564 0.758317 -0.147240 -0.215803 -0.193321
12050 5 13 3 6 2 5 2 1 -0.991426 2.061897 1.539079 -0.147240 -0.215803 4.733873

We can see that the continuous variables are all normalized. This looks good!

y_train[:5]
array([0, 1, 1, 0, 1], dtype=int8)

Continuing with FastAI

If we wanted to use the data on a FastAI model, we’d need to create DataLoaders.

batch_size = 128
dls = df_wrapper.dataloaders(bs=batch_size)

Let’s look at our data to make sure it looks right.

batch = next(iter(dls.train))

We are expecting three objects in each batch: the categorical variables, the continuous variables, and the labels. Let’s take a look.

len(batch)
3
cat_vars, cont_vars, labels = batch
cat_vars[:5]
tensor([[ 5, 12,  5,  9,  3,  3,  1,  1],
        [ 2, 10,  3, 11,  1,  5,  2,  1],
        [ 5, 12,  5, 14,  2,  5,  2,  1],
        [ 5,  2,  7,  9,  5,  5,  1,  1],
        [ 5, 12,  3,  5,  6,  5,  1,  1]])
cont_vars[:5]
tensor([[-0.5534,  0.0047, -0.4128, -0.1472, -0.2158, -0.0318],
        [ 0.3228,  1.7249,  1.1487, -0.1472, -0.2158, -0.0318],
        [-0.1883,  1.5283, -0.4128, -0.1472, -0.2158,  0.7760],
        [ 0.1768,  1.4803, -1.1936, -0.1472, -0.2158, -0.0318],
        [-0.0423, -0.0218, -0.4128, -0.1472, -0.2158,  1.1798]])
labels[:5]
tensor([[0],
        [1],
        [0],
        [0],
        [0]], dtype=torch.int8)

Looks good!

Now we make a learner. This data isn’t very complex so we’ll use a relatively small model for it.

learn = tabular_learner(dls, layers=[20,10])

Let’s fit the model.

learn.fit(4, 1e-2)
epoch train_loss valid_loss time
0 0.331239 0.322867 00:02
1 0.323588 0.318893 00:01
2 0.320338 0.325158 00:01
3 0.324844 0.321952 00:01

If we didn’t pass a validation set we wouldn’t have gotten any valid_loss.

Now we can save the model.

save_path = Path(os.environ['MODELS']) / 'adult_dataset'
os.makedirs(save_path, exist_ok=True)
learn.save(save_path / 'baseline_neural_network')
Path('I:/Models/adult_dataset/baseline_neural_network.pth')

Part II

To fully simulate this being a separate test, I’m going to reload the model from the weights. Note that we would have to create a learn object before we load the weights. In this case we’ll use the same learn as before.

learn.load(save_path / 'baseline_neural_network')
<fastai.tabular.learner.TabularLearner at 0x1e79f48c730>

Let’s look at the model and make sure it loaded correctly.

learn.summary()
TabularModel (Input shape: 128 x 8)
============================================================================
Layer (type)         Output Shape         Param #    Trainable 
============================================================================
                     128 x 6             
Embedding                                 60         True      
____________________________________________________________________________
                     128 x 8             
Embedding                                 136        True      
____________________________________________________________________________
                     128 x 5             
Embedding                                 40         True      
____________________________________________________________________________
                     128 x 8             
Embedding                                 128        True      
____________________________________________________________________________
                     128 x 5             
Embedding                                 35         True      
____________________________________________________________________________
                     128 x 4             
Embedding                                 24         True      
____________________________________________________________________________
                     128 x 3             
Embedding                                 9          True      
Embedding                                 9          True      
Dropout                                                        
BatchNorm1d                               12         True      
____________________________________________________________________________
                     128 x 20            
Linear                                    960        True      
ReLU                                                           
BatchNorm1d                               40         True      
____________________________________________________________________________
                     128 x 10            
Linear                                    200        True      
ReLU                                                           
BatchNorm1d                               20         True      
____________________________________________________________________________
                     128 x 2             
Linear                                    22         True      
____________________________________________________________________________

Total params: 1,695
Total trainable params: 1,695
Total non-trainable params: 0

Optimizer used: <function Adam at 0x000001E7A3A0D670>
Loss function: FlattenedLoss of CrossEntropyLoss()

Model unfrozen

Callbacks:
  - TrainEvalCallback
  - Recorder
  - ProgressCallback

Looks good. Let’s look at the test data.

test_df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
14160 30 Private 81282 HS-grad 9.0 Never-married Other-service Unmarried White Female 0 0 40 United-States <50k
27048 38 Federal-gov 172571 Some-college 10.0 Divorced Adm-clerical Not-in-family White Male 0 0 40 United-States >=50k
28868 40 Private 223548 HS-grad 9.0 Married-civ-spouse Adm-clerical Husband White Male 0 0 40 Mexico <50k
5667 28 Local-gov 191177 Masters 14.0 Married-civ-spouse Prof-specialty Wife White Female 0 0 20 United-States >=50k
7827 31 Private 210562 HS-grad 9.0 Married-civ-spouse Transport-moving Husband White Male 0 0 65 United-States <50k

Because the data is imbalanced we’ll have to adjust our baseline. A completely “dumb” classifier that only guesses the most common class will be right more than 50% of the time. Let’s see what that percentage is.

test_df['salary'].value_counts()
<50k     6183
>=50k    1958
Name: salary, dtype: int64
test_df['salary'].value_counts()[0] / np.sum(test_df['salary'].value_counts())
0.7594890062645867

OK, so 75% is our baseline that we have to beat.

The data looks like we expected. Now we follow a similar process as what we did before.

test_splits = no_split(range_of(test_df))
test_df_wrapper = TabularPandas(test_df, preprocessing, categorical_vars, continuous_vars, splits=test_splits, y_names=dep_var)

Now we can turn that into a DataLoaders object.

Note: If your test set size isn’t divisible by your batch size you’ll need to drop_last. If I don’t I get an error, although I’ve only noticed this happening with the test set.

test_dls = test_df_wrapper.dataloaders(batch_size, drop_last=False)

Now we’ve got everything in place to make predictions.

preds, ground_truth = learn.get_preds(dl=test_dls.train)

Let’s see what they look like.

preds[:5]
tensor([[0.9943, 0.0057],
        [0.9559, 0.0441],
        [0.6239, 0.3761],
        [0.4550, 0.5450],
        [0.7262, 0.2738]])
ground_truth[:5]
tensor([[0],
        [1],
        [0],
        [1],
        [0]], dtype=torch.int8)

Depending on your last layer, converting the prediction into an actual prediction will be different. In this case have a probability associated with each value, so to get the final prediction we need to take an argmax. Had you just had one value in the last layer, you could extract the label prediction with np.rint(preds).

You can test this by seeing that each prediction sums to 1.

preds.sum(1)
tensor([1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000])
torch.argmax(preds, dim=1)
tensor([0, 0, 0,  ..., 1, 0, 1])

Let’s see what our final accuracy is on the test set.

accuracy_score(ground_truth, torch.argmax(preds, dim=1))
0.851001105515293