Working with US Census Bureau Data

I found that the US Census API is difficult to work with and even LLMs don’t provide working code for it. So I thought it might be helpful to share some techniques that did work. In this post, I’m going to focus on both raw API calls and the Python wrapper.

Table of Contents

API key
Understanding Tables
Direct API Calls
- Verifying Results
Getting More Data
Getting Population Groups
Using the Census Wrapper
Using the Website
Population Counts Mixed with Dollar Amounts
Note on Dates

API key

import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
from census import Census

You need to get an API key. Fortunately, this part is really easy—all you need to do is sign up on their website.

API_KEY = os.getenv('CENSUS_API_KEY')

There are different ways to access the census data, including through their website, through direct API calls, and through a Python wrapper.

Understanding Tables

Here, we’ll use a different table. You can see the different tables here: https://api.census.gov/data/2021/acs/acs5/groups/

In the census data, you’ll also see IDs like B19013A_001E. Let’s take a moment to understand what it means. It’s in the following format:

[TABLE_ID][SUBGROUP]_[LINE][SUFFIX]

So, in this case:

B19013A is the table id. This specific table is titled “Median Household Income in the Past 12 Months (in 2021 Inflation-Adjusted Dollars)”.
001 refers to the line number within that table, corresponding to a specific row (e.g., “Median household income”).
E stands for Estimate — as opposed to M, which would be the Margin of Error for that estimate.

One table with a lot of data is S0201. S0201 refers to the Selected Population Profile (SPP) table series. This is used for detailed demographic, social, economic, and housing data by race, Hispanic origin, tribal group, or ancestry.

Direct API Calls

We’re going to look at the American Community Survey (ACS) Select Population Profiles (SPP) data.

You need to provide fields and iteration codes. You can find which population is associated with which iteration code here: https://www2.census.gov/programs-surveys/decennial/datasets/summary-file-2/attachment.pdf

For fields, we’re using S0201_214E. S0201 is the table and _214 is the line number within the S0201 table, which corresponds to a specific data item. This is how we can get median household income.

For example, it tells you that 013 is the iteration code for Asian Indians.

YEAR     = 2022  # latest year that I could find that had everything I was looking for
DATASET  = f"https://api.census.gov/data/{YEAR}/acs/acs1/spp"
FIELDS   = "NAME,S0201_214E"                             # median household income
POP_CODE = "013"                                         # <-- Asian Indian alone
URL      = (f"{DATASET}?get={FIELDS}"
            f"&for=us:1&POPGROUP={POP_CODE}&key={API_KEY}")

resp = requests.get(URL, timeout=30)
rows = resp.json()

print(resp.status_code)
print(resp.text[:500])

200
[["NAME","S0201_214E","POPGROUP","us"],
["United States","152341","013","1"]]

df = pd.DataFrame(rows[1:], columns=rows[0])
df["S0201_214E"] = pd.to_numeric(df["S0201_214E"])
print("Median HH income (Asian-Indian-American, 2022):",
      f"${int(df.at[0,'S0201_214E']):,}")

Median HH income (Asian-Indian-American, 2022): $152,341

Verifying Results

It’s good to have a way to verify the data as well. For example, you can verify some of the results simply by Googling the number and making sure that’s what other people got. By Googling $152,341 you can see other news sites that use the same value and describe it as Indian annual median household income.

Getting More Data

OK, so we can get a single data point from a query, but it’s inefficient to do that for lots of data. Let’s grab data for multiple groups in a single request.

Here we also need a field. We’re going to use S0201_214E. You can see on the SPP variables table that S0201_214E corresponds to “Median household income (dollars)”.

YEAR     = 2022
DATASET  = f"https://api.census.gov/data/{YEAR}/acs/acs1/spp"
FIELDS   = "NAME,S0201_214E,POPGROUP"  # median household income + population group
URL      = (f"{DATASET}?get={FIELDS}"
            f"&for=us:1&key={API_KEY}")
resp = requests.get(URL, timeout=30)
rows = resp.json()

# Convert to pandas DataFrame
income_df = pd.DataFrame(rows[1:], columns=rows[0])
income_df['S0201_214E'] = pd.to_numeric(income_df['S0201_214E'], errors='coerce')

len(income_df)

income_df.head()

	NAME	S0201_214E	POPGROUP	us
0	United States	74755	001	1
1	United States	79933	002	1
2	United States	78636	003	1
3	United States	51374	004	1
4	United States	52238	005	1

The codes are not that helpful directly and need to be converted using the link above. You can find the full dictionary here: code_to_population.py (Gist). We’ll use curl to download it:

!curl -o census_popgroup_dict.py https://gist.githubusercontent.com/jss367/44e041c913f87a11b2830e01e295c241/raw/c54c8ffaf838791c1a1c42fc03d493bbb3fe3b84/gistfile1.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6932  100  6932    0     0  28768      0 --:--:-- --:--:-- --:--:-- 28883

from census_popgroup_dict import code_to_population

income_df['POPGROUP_DESC'] = income_df['POPGROUP'].map(code_to_population)

income_df.head()

	NAME	S0201_214E	POPGROUP	us	POPGROUP_DESC
0	United States	74755	001	1	Total Population
1	United States	79933	002	1	White alone
2	United States	78636	003	1	White alone or in combination with one or more...
3	United States	51374	004	1	Black or African American alone
4	United States	52238	005	1	Black or African American alone or in combinat...

income_df.tail()

	NAME	S0201_214E	POPGROUP	us	POPGROUP_DESC
342	United States	126414	931	1	NaN
343	United States	135643	932	1	NaN
344	United States	55352	946	1	NaN
345	United States	80191	9Z8	1	NaN
346	United States	78411	9Z9	1	NaN

Unfortunately, many are missing and I’m not sure what the issue is at the moment.

income_df.dropna(inplace=True)

len(income_df)

income_df.sample(10)

	NAME	S0201_214E	POPGROUP	us	POPGROUP_DESC
129	United States	77024	420	1	Peruvian (237)
146	United States	78234	462	1	Some Other Race alone or in combination with o...
147	United States	72601	463	1	Two or More Races, not Hispanic or Latino
0	United States	74755	001	1	Total Population
132	United States	82993	423	1	Spaniard (200-209)
74	United States	85527	117	1	Asian; Native Hawaiian and Other Pacific Islander
145	United States	75631	461	1	Some Other Race alone, not Hispanic or Latino
113	United States	66241	403	1	Cuban (270-274)
46	United States	76421	060	1	Native Hawaiian and Other Pacific Islander alo...
66	United States	95428	107	1	White; Asian

We can see the Asian Indian data again.

income_df[income_df['POPGROUP'] == '013']

	NAME	S0201_214E	POPGROUP	us	POPGROUP_DESC
8	United States	152341	013	1	Asian Indian alone (400-401)

Getting Population Groups

params = {
    "get": "POPGROUP,POPGROUP_LABEL",  # Request codes and labels
    "for": "us:1",                     # National level
    "key": API_KEY
}

year = 2023

base_url = f"https://api.census.gov/data/{year}/acs/acs1/spp"

# Make the request
response = requests.get(base_url, params=params)

# Check if request was successful
response.raise_for_status()

# Parse JSON response
data = response.json()

# Create DataFrame from the response (skip header row)
popgroups_df = pd.DataFrame(data[1:], columns=data[0])

# Convert to appropriate data types
popgroups_df = popgroups_df.convert_dtypes()

len(popgroups_df)

popgroups_df.sample(10)

	POPGROUP	POPGROUP_LABEL	us
1066	1462	Native Village of Buckland alone	1
1935	21H	Tlingit alone	1
1233	2124	Skull Valley Band of Goshute Indians of Utah a...	1
1262	2193	Upper Chinook alone	1
1523	3885	Rotuman alone	1
3451	2822	Village of Solomon alone or in any combination	1
4873	2907	Cherokee Alabama alone or in any combination	1
1566	563	African	1
1080	095	Mariana Islander alone	1
3867	2590	Central Council of the Tlingit and Haida India...	1

Using the Census Wrapper

There is also a census wrapper you can use that’s available for download at https://pypi.org/project/census/. Let’s use it to get some income data.

Let’s look at the B19013 table. You’ll note that not all subgroups are available here. If you want to dig deeper into, say, Asian subgroups, you need to look at a different table.

{
  "name": "B19013A",
  "description": "MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE HOUSEHOLDER)",
  "variables": "http://api.census.gov/data/2021/acs/acs5/groups/B19013A.json",
  "universe ": "Households with a householder who is White alone"
}

Now we call the census wrapper.

c = Census(API_KEY)

race_data = c.acs5.get(
    ('NAME', 
     'B19013_001E',  # Total population
     'B19013A_001E', # White alone
     'B19013B_001E', # Black alone
     'B19013C_001E', # American Indian/Alaska Native alone
     'B19013D_001E', # Asian alone
     'B19013E_001E', # Native Hawaiian/Pacific Islander alone
     'B19013F_001E', # Some other race alone
     'B19013G_001E', # Two or more races
     'B19013H_001E', # White alone, not Hispanic
     'B19013I_001E', # Hispanic/Latino origin (any race)
    ),
    {'for': 'us:*'}
)

race_data

df = pd.DataFrame(race_data)
df

It’s… a little ugly. So we can rename the columns.

df = df.rename(columns={
    'B19013_001E': 'Median_Income_Total',
    'B19013A_001E': 'Median_Income_White_Alone',
    'B19013B_001E': 'Median_Income_Black_Alone',
    'B19013C_001E': 'Median_Income_AmIndian_Alone',
    'B19013D_001E': 'Median_Income_Asian_Alone',
    'B19013E_001E': 'Median_Income_Hawaiian_Alone',
    'B19013F_001E': 'Median_Income_Other_Alone',
    'B19013G_001E': 'Median_Income_TwoOrMore',
    'B19013H_001E': 'Median_Income_White_NonHispanic',
    'B19013I_001E': 'Median_Income_Hispanic',
})

df

Using the Website

I don’t find the website particularly easy to use, either. You can see some of the same information though. Here is the page for the American Community Survey, contains a lot of their data:

https://data.census.gov/table?q=American+Community+Survey&t=-A0

Here is the ACS data on Asian Indians:

https://data.census.gov/table?t=013:Income+and+Poverty&g=010XX00US.

Here’s the same for total population:

https://data.census.gov/table?t=001:Income+and+Poverty&g=010XX00US

You can see in the URL how the iteration codes work. You can either change that value directly or use the filters on the left sidebar.

Population Counts Mixed with Dollar Amounts

You will also find population counts mixed in with dollar amounts, which can be confusing. In the image below, the per capita income is in dollars, the “With earnings for full-time, year-round workers” (Male and Female) is in number of people, and the “Mean earnings (dollars) for full-time, year-round workers” (Male and Female) is in dollars again.

Note on Dates

You might have noticed that I used 2022 in the example above. That’s because that’s the 2023 (and beyond) data doesn’t seem to be there for every table. Sometimes they are available though so I think you just have to check.

YEAR     = 2023 
DATASET  = f"https://api.census.gov/data/{YEAR}/acs/acs1/spp"
FIELDS   = "NAME,S0201_214E"
POP_CODE = "013"
URL      = (f"{DATASET}?get={FIELDS}"
            f"&for=us:1&POPGROUP={POP_CODE}&key={API_KEY}")

empty_resp = requests.get(URL, timeout=30)

print(empty_resp.status_code)
print(empty_resp.text[:500])

You can see that I got a 204 back, indicating that there was no content returned.