FastAI Tabular Data Tutorial

This post is a tutorial on working with tabular data using FastAI. One of FastAI’s biggest contributions in working with tabular data is the ease with which embeddings can be used for categorical variables. I have found that using embeddings for categorical variables results in significantly better models than the alternatives (e.g. one-hot encoding). I have found that the combination of embeddings and neural networks reach very high performance with tabular data.

from fastai.tabular.all import *
from pyxtend import struct
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

We’ll use the UCI Adult Data Set where the task is to predict whether a person makes over 50k a year. FastAI makes downloading the dataset easy.

path = untar_data(URLs.ADULT_SAMPLE)

Once it’s downloaded we can load it into a DataFrame.

df = pd.read_csv(path/'adult.csv')

Many times machine learning practitioners are dealing with datasets that have already been split into train and test sets. In this case we have all of the data, but I am going to split the data into a train and test split to simulate a pre-defined split.

Part I

train_df, test_df = train_test_split(df, random_state=42)

Let’s take a look at the data.

train_df.head(10)

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
29	42	Private	70055	11th	7.0	Married-civ-spouse	NaN	Husband	White	Male	0	0	45	United-States	<50k
12181	25	Private	253267	Some-college	10.0	Married-civ-spouse	Adm-clerical	Husband	Black	Male	0	1902	36	United-States	>=50k
18114	53	Self-emp-not-inc	145419	1st-4th	2.0	Married-civ-spouse	Exec-managerial	Husband	White	Male	7688	0	67	Italy	>=50k
4278	37	State-gov	354929	Assoc-acdm	12.0	Divorced	Protective-serv	Not-in-family	Black	Male	0	0	38	United-States	<50k
12050	25	Private	404616	Masters	14.0	Married-civ-spouse	Farming-fishing	Not-in-family	White	Male	0	0	99	United-States	>=50k
14371	20	Private	303565	Some-college	10.0	Never-married	Handlers-cleaners	Own-child	Black	Male	0	0	40	Germany	<50k
32541	24	Private	241857	Some-college	10.0	Never-married	Adm-clerical	Not-in-family	Black	Female	0	0	35	United-States	<50k
3362	48	Private	398843	Some-college	10.0	Separated	Sales	Unmarried	Black	Female	0	0	35	United-States	<50k
19009	46	Private	109227	Some-college	10.0	Divorced	Exec-managerial	Unmarried	White	Female	0	0	70	United-States	<50k
16041	26	Private	171114	Bachelors	13.0	Never-married	Exec-managerial	Own-child	White	Female	0	0	40	United-States	<50k

train_df.describe()

	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week
count	24420.000000	2.442000e+04	24057.000000	24420.000000	24420.000000	24420.000000
mean	38.578911	1.895367e+05	10.058361	1066.490254	86.502457	40.393366
std	13.696620	1.043135e+05	2.580948	7243.366967	400.848415	12.380526
min	17.000000	1.228500e+04	1.000000	0.000000	0.000000	1.000000
25%	28.000000	1.183052e+05	9.000000	0.000000	0.000000	40.000000
50%	37.000000	1.784825e+05	10.000000	0.000000	0.000000	40.000000
75%	48.000000	2.366420e+05	12.000000	0.000000	0.000000	45.000000
max	90.000000	1.455435e+06	16.000000	99999.000000	4356.000000	99.000000

train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24420 entries, 29 to 23654
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             24420 non-null  int64  
 1   workclass       24420 non-null  object 
 2   fnlwgt          24420 non-null  int64  
 3   education       24420 non-null  object 
 4   education-num   24057 non-null  float64
 5   marital-status  24420 non-null  object 
 6   occupation      24031 non-null  object 
 7   relationship    24420 non-null  object 
 8   race            24420 non-null  object 
 9   sex             24420 non-null  object 
 10  capital-gain    24420 non-null  int64  
 11  capital-loss    24420 non-null  int64  
 12  hours-per-week  24420 non-null  int64  
 13  native-country  24420 non-null  object 
 14  salary          24420 non-null  object 
dtypes: float64(1), int64(5), object(9)
memory usage: 3.0+ MB

train_df['salary'].value_counts()

<50k     18537
>=50k     5883
Name: salary, dtype: int64

The first thing to note is that there is missing data. We’ll have to deal with that; fortunately, FastAI has tools that make this easy. Also, it looks like we have both continuous and categorical data. We’ll split those apart so we can put the categorical data through embeddings. Also, the data is highly imbalanced. We could correct for this but I’ll skip over that for now. The imbalance isn’t so bad that it would completely stop the network from learning.

Note that the variable we’re trying to predict, salary, is in the DataFrame. That’s fine, we’ll just need to tell cont_cat_split what the dependent variable is so it isn’t included in the training variables.

dep_var = 'salary'

continuous_vars, categorical_vars = cont_cat_split(train_df, dep_var=dep_var)

The cont_cat_split function usually works well, but I always double check the results to see that they make sense.

train_df[continuous_vars].nunique()

age                  72
fnlwgt            17545
education-num        16
capital-gain        116
capital-loss         90
hours-per-week       93
dtype: int64

train_df[categorical_vars].nunique()

workclass          9
education         16
marital-status     7
occupation        15
relationship       6
race               5
sex                2
native-country    41
dtype: int64

Let’s think about the data. One thing that sticks out to me is that native-country has 41 different unique values in the train set. This means there’s a good chance there will be a new native-country in the test set (or after we deploy it!). This will be a problem if we use embeddings. There are ways to deal with unknown categories and embeddings but it’s easiest to simply remove it.

categorical_vars.remove('native-country')

categorical_vars

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex']

Now we need to decide what preprocessing we need to do. We noted there is missing data, so we’ll need to use FillMissing to clean that up. Also, we should always Normalize the data. Finally, we’ll use Categorify to transform the categorical variables to be similar to pd.Categorical.

preprocessing = [Categorify, FillMissing, Normalize]

We’ve already split our data because we’re simulating that it’s already been split for us. But we will still need to pass a splitter to TabularPandas, so we’ll make one that puts everything in the train set and nothing in the validation set.

def no_split(obj):
    """
    Put everything in the train set
    """
    return list(range(len(obj))), []

splits = no_split(range_of(train_df))

struct(splits)

{tuple: [{list: [int, int, int, '...24420 total']}, {list: []}]}

There are a lot of things that don’t work as well in FastAI if you don’t have a validation set, like get_preds and the output from training, so I’m going to add it here. This is simple to do.

full_df = pd.concat([train_df, test_df])

val_indices = list(range(len(train_df),len(train_df) + len(test_df)))

ind_splitter = IndexSplitter(val_indices)

splits = ind_splitter(full_df) 

Now we need to create a TabularPandas for our data. A TabularPandas is wrapper for a pandas DataFrame where the continuous, categorical, and dependent variables are known. FastAI uses lots of inheritance, and the inheritances aren’t always intuitive to me, so it’s good to look at the method resolution order to get a sense of what the class is supposed to do. You can do so like this:

TabularPandas.__mro__

(fastai.tabular.core.TabularPandas,
 fastai.tabular.core.Tabular,
 fastcore.foundation.CollBase,
 fastcore.basics.GetAttr,
 fastai.data.core.FilteredBase,
 object)

If we just wanted to pass the train set, we would use train_df and no_split(range_of(train_df)). But we’re going to pass the validation set as well, so we’ll use full_df and ind_splitter(full_df).

df_wrapper = TabularPandas(full_df, procs=preprocessing, cat_names=categorical_vars, cont_names=continuous_vars,
                   y_names=dep_var, splits=splits)

Let’s look at some examples to make sure they look right. All the data should be ready for deep learning.

If we wanted to get the data in the familiar X_train, y_train, X_test, y_test format a scikit-learn model, all we have to do is this:

X_train, y_train = df_wrapper.train.xs, df_wrapper.train.ys.values.ravel()
X_test, y_test = df_wrapper.valid.xs, df_wrapper.valid.ys.values.ravel()

Now the data are in a DataFrame fully ready to be used in a scikit-learn or xgboost model. We can explore the data to see this.

X_train.head()

	workclass	education	marital-status	occupation	relationship	race	sex	education-num_na	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week
29	5	2	3	0	1	5	2	1	0.249781	-1.145433	-1.193588	-0.147240	-0.215803	0.372095
12181	5	16	3	2	1	3	2	1	-0.991426	0.610962	-0.022445	-0.147240	4.529230	-0.354868
18114	7	4	3	5	1	5	2	1	1.052916	-0.422942	-3.145492	0.914167	-0.215803	2.149115
4278	8	8	1	12	2	3	2	1	-0.115280	1.585564	0.758317	-0.147240	-0.215803	-0.193321
12050	5	13	3	6	2	5	2	1	-0.991426	2.061897	1.539079	-0.147240	-0.215803	4.733873

We can see that the continuous variables are all normalized. This looks good!

y_train[:5]

array([0, 1, 1, 0, 1], dtype=int8)

Continuing with FastAI

If we wanted to use the data on a FastAI model, we’d need to create DataLoaders.

batch_size = 128
dls = df_wrapper.dataloaders(bs=batch_size)

Let’s look at our data to make sure it looks right.

batch = next(iter(dls.train))

We are expecting three objects in each batch: the categorical variables, the continuous variables, and the labels. Let’s take a look.

len(batch)

cat_vars, cont_vars, labels = batch

cat_vars[:5]

tensor([[ 5, 12,  5,  9,  3,  3,  1,  1],
        [ 2, 10,  3, 11,  1,  5,  2,  1],
        [ 5, 12,  5, 14,  2,  5,  2,  1],
        [ 5,  2,  7,  9,  5,  5,  1,  1],
        [ 5, 12,  3,  5,  6,  5,  1,  1]])

cont_vars[:5]

tensor([[-0.5534,  0.0047, -0.4128, -0.1472, -0.2158, -0.0318],
        [ 0.3228,  1.7249,  1.1487, -0.1472, -0.2158, -0.0318],
        [-0.1883,  1.5283, -0.4128, -0.1472, -0.2158,  0.7760],
        [ 0.1768,  1.4803, -1.1936, -0.1472, -0.2158, -0.0318],
        [-0.0423, -0.0218, -0.4128, -0.1472, -0.2158,  1.1798]])

labels[:5]

tensor([[0],
        [1],
        [0],
        [0],
        [0]], dtype=torch.int8)

Looks good!

Now we make a learner. This data isn’t very complex so we’ll use a relatively small model for it.

learn = tabular_learner(dls, layers=[20,10])

Let’s fit the model.

learn.fit(4, 1e-2)

epoch	train_loss	valid_loss	time
0	0.331239	0.322867	00:02
1	0.323588	0.318893	00:01
2	0.320338	0.325158	00:01
3	0.324844	0.321952	00:01

If we didn’t pass a validation set we wouldn’t have gotten any valid_loss.

Now we can save the model.

save_path = Path(os.environ['MODELS']) / 'adult_dataset'
os.makedirs(save_path, exist_ok=True)

learn.save(save_path / 'baseline_neural_network')

Path('I:/Models/adult_dataset/baseline_neural_network.pth')

Part II

To fully simulate this being a separate test, I’m going to reload the model from the weights. Note that we would have to create a learn object before we load the weights. In this case we’ll use the same learn as before.

learn.load(save_path / 'baseline_neural_network')

<fastai.tabular.learner.TabularLearner at 0x1e79f48c730>

Let’s look at the model and make sure it loaded correctly.

learn.summary()

TabularModel (Input shape: 128 x 8)
============================================================================
Layer (type)         Output Shape         Param #    Trainable 
============================================================================
                     128 x 6             
Embedding                                 60         True      
____________________________________________________________________________
                     128 x 8             
Embedding                                 136        True      
____________________________________________________________________________
                     128 x 5             
Embedding                                 40         True      
____________________________________________________________________________
                     128 x 8             
Embedding                                 128        True      
____________________________________________________________________________
                     128 x 5             
Embedding                                 35         True      
____________________________________________________________________________
                     128 x 4             
Embedding                                 24         True      
____________________________________________________________________________
                     128 x 3             
Embedding                                 9          True      
Embedding                                 9          True      
Dropout                                                        
BatchNorm1d                               12         True      
____________________________________________________________________________
                     128 x 20            
Linear                                    960        True      
ReLU                                                           
BatchNorm1d                               40         True      
____________________________________________________________________________
                     128 x 10            
Linear                                    200        True      
ReLU                                                           
BatchNorm1d                               20         True      
____________________________________________________________________________
                     128 x 2             
Linear                                    22         True      
____________________________________________________________________________

Total params: 1,695
Total trainable params: 1,695
Total non-trainable params: 0

Optimizer used: <function Adam at 0x000001E7A3A0D670>
Loss function: FlattenedLoss of CrossEntropyLoss()

Model unfrozen

Callbacks:
  - TrainEvalCallback
  - Recorder
  - ProgressCallback

Looks good. Let’s look at the test data.

test_df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	hours-per-week	native-country	salary
14160	30	Private	81282	HS-grad	9.0	Never-married	Other-service	Unmarried	White	Female	40	United-States	<50k
27048	38	Federal-gov	172571	Some-college	10.0	Divorced	Adm-clerical	Not-in-family	White	Male	40	United-States	>=50k
28868	40	Private	223548	HS-grad	9.0	Married-civ-spouse	Adm-clerical	Husband	White	Male	40	Mexico	<50k
5667	28	Local-gov	191177	Masters	14.0	Married-civ-spouse	Prof-specialty	Wife	White	Female	20	United-States	>=50k
7827	31	Private	210562	HS-grad	9.0	Married-civ-spouse	Transport-moving	Husband	White	Male	65	United-States	<50k

Because the data is imbalanced we’ll have to adjust our baseline. A completely “dumb” classifier that only guesses the most common class will be right more than 50% of the time. Let’s see what that percentage is.

test_df['salary'].value_counts()

<50k     6183
>=50k    1958
Name: salary, dtype: int64

test_df['salary'].value_counts()[0] / np.sum(test_df['salary'].value_counts())

0.7594890062645867

OK, so 75% is our baseline that we have to beat.

The data looks like we expected. Now we follow a similar process as what we did before.

test_splits = no_split(range_of(test_df))

test_df_wrapper = TabularPandas(test_df, preprocessing, categorical_vars, continuous_vars, splits=test_splits, y_names=dep_var)

Now we can turn that into a DataLoaders object.

Note: If your test set size isn’t divisible by your batch size you’ll need to drop_last. If I don’t I get an error, although I’ve only noticed this happening with the test set.

test_dls = test_df_wrapper.dataloaders(batch_size, drop_last=False)

Now we’ve got everything in place to make predictions.

preds, ground_truth = learn.get_preds(dl=test_dls.train)

Let’s see what they look like.

preds[:5]

tensor([[0.9943, 0.0057],
        [0.9559, 0.0441],
        [0.6239, 0.3761],
        [0.4550, 0.5450],
        [0.7262, 0.2738]])

ground_truth[:5]

tensor([[0],
        [1],
        [0],
        [1],
        [0]], dtype=torch.int8)

Depending on your last layer, converting the prediction into an actual prediction will be different. In this case have a probability associated with each value, so to get the final prediction we need to take an argmax. Had you just had one value in the last layer, you could extract the label prediction with np.rint(preds).

You can test this by seeing that each prediction sums to 1.

preds.sum(1)

tensor([1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000])

torch.argmax(preds, dim=1)

tensor([0, 0, 0,  ..., 1, 0, 1])

Let’s see what our final accuracy is on the test set.

accuracy_score(ground_truth, torch.argmax(preds, dim=1))

0.851001105515293