I found that the US Census API is difficult to work with and even LLMs don’t provide working code for it. So I thought it might be helpful to share some techniques that did work. In this post, I’m going to focus on both raw API calls and the Python wrapper.
Table of Contents
- API key
- Understanding Tables
- Direct API Calls
- Getting More Data
- Getting Population Groups
- Using the Census Wrapper
- Using the Website
- Errors in the Data
- Note on Dates
API key
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
from census import Census
You need to get an API key. Fortunately, this part is really easy—all you need to do is sign up on their website.
API_KEY = os.getenv('CENSUS_API_KEY')
There are different ways to access the census data, including through their website, through direct API calls, and through a Python wrapper.
Understanding Tables
Here, we’ll use a different table. You can see the different tables here: https://api.census.gov/data/2021/acs/acs5/groups/
In the census data, you’ll also see IDs like B19013A_001E
. Let’s take a moment to understand what it means. It’s in the following format:
[TABLE_ID][SUBGROUP]_[LINE][SUFFIX]
So, in this case:
B19013A
is the table id. This specific table is titled “Median Household Income in the Past 12 Months (in 2021 Inflation-Adjusted Dollars)”.001
refers to the line number within that table, corresponding to a specific row (e.g., “Median household income”).E
stands for Estimate — as opposed to M, which would be the Margin of Error for that estimate.
One table with a lot of data is S0201. S0201 refers to the Selected Population Profile (SPP) table series. This is used for detailed demographic, social, economic, and housing data by race, Hispanic origin, tribal group, or ancestry.
Direct API Calls
We’re going to look at the American Community Survey (ACS) Select Population Profiles (SPP) data.
You need to provide fields
and iteration codes
. You can find which population is associated with which iteration code here: https://www2.census.gov/programs-surveys/decennial/datasets/summary-file-2/attachment.pdf
For fields, we’re using S0201_214E
. S0201 is the table and _214 is the line number within the S0201 table, which corresponds to a specific data item. This is how we can get median household income.
For example, it tells you that 013
is the iteration code for Asian Indians.
YEAR = 2022 # latest year that I could find that had everything I was looking for
DATASET = f"https://api.census.gov/data/{YEAR}/acs/acs1/spp"
FIELDS = "NAME,S0201_214E" # median household income
POP_CODE = "013" # <-- Asian Indian alone
URL = (f"{DATASET}?get={FIELDS}"
f"&for=us:1&POPGROUP={POP_CODE}&key={API_KEY}")
resp = requests.get(URL, timeout=30)
rows = resp.json()
print(resp.status_code)
print(resp.text[:500])
200
[["NAME","S0201_214E","POPGROUP","us"],
["United States","152341","013","1"]]
df = pd.DataFrame(rows[1:], columns=rows[0])
df["S0201_214E"] = pd.to_numeric(df["S0201_214E"])
print("Median HH income (Asian-Indian-American, 2022):",
f"${int(df.at[0,'S0201_214E']):,}")
Median HH income (Asian-Indian-American, 2022): $152,341
Verifying Results
It’s good to have a way to verify the data as well. For example, you can verify some of the results simply by Googling the number and making sure that’s what other people got. By Googling $152,341 you can see other newsites that use the same value and describe it as Indian annual median household income.
Getting More Data
OK, so we can get a single data point from a query, but it’s inefficient to do that for lots of data. Let’s grab data for multiple groups in a single request.
Here we also need a field. We’re going to use S0201_214E
. You can see on the SPP variables table that S0201_214E
corresponds to “Median household income (dollars)”.
YEAR = 2022
DATASET = f"https://api.census.gov/data/{YEAR}/acs/acs1/spp"
FIELDS = "NAME,S0201_214E,POPGROUP" # median household income + population group
URL = (f"{DATASET}?get={FIELDS}"
f"&for=us:1&key={API_KEY}")
resp = requests.get(URL, timeout=30)
rows = resp.json()
# Convert to pandas DataFrame
income_df = pd.DataFrame(rows[1:], columns=rows[0])
income_df['S0201_214E'] = pd.to_numeric(income_df['S0201_214E'], errors='coerce')
len(income_df)
347
income_df.head()
NAME | S0201_214E | POPGROUP | us | |
---|---|---|---|---|
0 | United States | 74755 | 001 | 1 |
1 | United States | 79933 | 002 | 1 |
2 | United States | 78636 | 003 | 1 |
3 | United States | 51374 | 004 | 1 |
4 | United States | 52238 | 005 | 1 |
The codes are not that helpful directly and need to be converted using the link above. You can find the full dictionary here: code_to_population.py (Gist). We’ll use curl
to download it:
!curl -o census_popgroup_dict.py https://gist.githubusercontent.com/jss367/44e041c913f87a11b2830e01e295c241/raw/c54c8ffaf838791c1a1c42fc03d493bbb3fe3b84/gistfile1.txt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 6932 100 6932 0 0 28768 0 --:--:-- --:--:-- --:--:-- 28883
from census_popgroup_dict import code_to_population
income_df['POPGROUP_DESC'] = income_df['POPGROUP'].map(code_to_population)
income_df.head()
NAME | S0201_214E | POPGROUP | us | POPGROUP_DESC | |
---|---|---|---|---|---|
0 | United States | 74755 | 001 | 1 | Total Population |
1 | United States | 79933 | 002 | 1 | White alone |
2 | United States | 78636 | 003 | 1 | White alone or in combination with one or more... |
3 | United States | 51374 | 004 | 1 | Black or African American alone |
4 | United States | 52238 | 005 | 1 | Black or African American alone or in combinat... |
income_df.tail()
NAME | S0201_214E | POPGROUP | us | POPGROUP_DESC | |
---|---|---|---|---|---|
342 | United States | 126414 | 931 | 1 | NaN |
343 | United States | 135643 | 932 | 1 | NaN |
344 | United States | 55352 | 946 | 1 | NaN |
345 | United States | 80191 | 9Z8 | 1 | NaN |
346 | United States | 78411 | 9Z9 | 1 | NaN |
Unfortunately, many are missing and I’m not sure what the issue is at the moment.
income_df.dropna(inplace=True)
len(income_df)
86
income_df.sample(10)
NAME | S0201_214E | POPGROUP | us | POPGROUP_DESC | |
---|---|---|---|---|---|
129 | United States | 77024 | 420 | 1 | Peruvian (237) |
146 | United States | 78234 | 462 | 1 | Some Other Race alone or in combination with o... |
147 | United States | 72601 | 463 | 1 | Two or More Races, not Hispanic or Latino |
0 | United States | 74755 | 001 | 1 | Total Population |
132 | United States | 82993 | 423 | 1 | Spaniard (200-209) |
74 | United States | 85527 | 117 | 1 | Asian; Native Hawaiian and Other Pacific Islander |
145 | United States | 75631 | 461 | 1 | Some Other Race alone, not Hispanic or Latino |
113 | United States | 66241 | 403 | 1 | Cuban (270-274) |
46 | United States | 76421 | 060 | 1 | Native Hawaiian and Other Pacific Islander alo... |
66 | United States | 95428 | 107 | 1 | White; Asian |
We can see the Asian Indian data again.
income_df[income_df['POPGROUP'] == '013']
NAME | S0201_214E | POPGROUP | us | POPGROUP_DESC | |
---|---|---|---|---|---|
8 | United States | 152341 | 013 | 1 | Asian Indian alone (400-401) |
Getting Population Groups
params = {
"get": "POPGROUP,POPGROUP_LABEL", # Request codes and labels
"for": "us:1", # National level
"key": API_KEY
}
year = 2023
base_url = f"https://api.census.gov/data/{year}/acs/acs1/spp"
# Make the request
response = requests.get(base_url, params=params)
# Check if request was successful
response.raise_for_status()
# Parse JSON response
data = response.json()
# Create DataFrame from the response (skip header row)
popgroups_df = pd.DataFrame(data[1:], columns=data[0])
# Convert to appropriate data types
popgroups_df = popgroups_df.convert_dtypes()
len(popgroups_df)
5545
popgroups_df.sample(10)
POPGROUP | POPGROUP_LABEL | us | |
---|---|---|---|
1066 | 1462 | Native Village of Buckland alone | 1 |
1935 | 21H | Tlingit alone | 1 |
1233 | 2124 | Skull Valley Band of Goshute Indians of Utah a... | 1 |
1262 | 2193 | Upper Chinook alone | 1 |
1523 | 3885 | Rotuman alone | 1 |
3451 | 2822 | Village of Solomon alone or in any combination | 1 |
4873 | 2907 | Cherokee Alabama alone or in any combination | 1 |
1566 | 563 | African | 1 |
1080 | 095 | Mariana Islander alone | 1 |
3867 | 2590 | Central Council of the Tlingit and Haida India... | 1 |
Using the Census Wrapper
There is also a census wrapper you can use that’s available for download at https://pypi.org/project/census/. Let’s use it to get some income data.
Let’s look at the B19013
table. You’ll note that not all subgroups are available here. If you want to dig deeper into, say, Asian subgroups, you need to look at a different table.
{
"name": "B19013A",
"description": "MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE HOUSEHOLDER)",
"variables": "http://api.census.gov/data/2021/acs/acs5/groups/B19013A.json",
"universe ": "Households with a householder who is White alone"
}
Now we call the census wrapper.
c = Census(API_KEY)
race_data = c.acs5.get(
('NAME',
'B19013_001E', # Total population
'B19013A_001E', # White alone
'B19013B_001E', # Black alone
'B19013C_001E', # American Indian/Alaska Native alone
'B19013D_001E', # Asian alone
'B19013E_001E', # Native Hawaiian/Pacific Islander alone
'B19013F_001E', # Some other race alone
'B19013G_001E', # Two or more races
'B19013H_001E', # White alone, not Hispanic
'B19013I_001E', # Hispanic/Latino origin (any race)
),
{'for': 'us:*'}
)
race_data
df = pd.DataFrame(race_data)
df
It’s… a little ugly. So we can rename the columns.
df = df.rename(columns={
'B19013_001E': 'Median_Income_Total',
'B19013A_001E': 'Median_Income_White_Alone',
'B19013B_001E': 'Median_Income_Black_Alone',
'B19013C_001E': 'Median_Income_AmIndian_Alone',
'B19013D_001E': 'Median_Income_Asian_Alone',
'B19013E_001E': 'Median_Income_Hawaiian_Alone',
'B19013F_001E': 'Median_Income_Other_Alone',
'B19013G_001E': 'Median_Income_TwoOrMore',
'B19013H_001E': 'Median_Income_White_NonHispanic',
'B19013I_001E': 'Median_Income_Hispanic',
})
df
Using the Website
I don’t find the website particularly easy to use, either. You can see some of the same information though. Here is the page for the American Community Survey, contains a lot of their data:
Here is the ACS data on Asian Indians:
Here’s the same for total population:
You can see in the URL how the iteration codes work. You can either change that value directly or use the filters on the left sidebar.
Errors in the Data
I was surprised to find lots of errors in the data. Here are a couple of examples. Beware, I guess!
Note on Dates
You might have noticed that I used 2022 in the example above. That’s because that’s the 2023 (and beyond) data doesn’t seem to be there for every table. Sometimes they are available though so I think you just have to check.
YEAR = 2023
DATASET = f"https://api.census.gov/data/{YEAR}/acs/acs1/spp"
FIELDS = "NAME,S0201_214E"
POP_CODE = "013"
URL = (f"{DATASET}?get={FIELDS}"
f"&for=us:1&POPGROUP={POP_CODE}&key={API_KEY}")
empty_resp = requests.get(URL, timeout=30)
print(empty_resp.status_code)
print(empty_resp.text[:500])
You can see that I got a 204 back, indicating that there was no content returned.