The cdcdata package provides a modern R interface to the
CDC’s open data portal (data.cdc.gov), which runs on the Socrata
platform. While similar packages exist such as CDCPLACES and RSocrata, these are
either too limited in scope or too generic for data.cdc.gov usage. The
hope for this package is to provide a way to programmatically access the
data.cdc.gov site in a user friendly way.
Discovering Datasets
The CDC data portal hosts hundreds of public health datasets. You can search for datasets by keyword, browse by category, or explore by tag.
Search by Keyword
Use cdc_datasets() to search for datasets:
# Search for PLACES data
cdc_datasets("PLACES")
# Search for mortality data
cdc_datasets("mortality", limit = 10)The function returns a data frame with dataset IDs, names,
descriptions, categories, tags, and row counts. The id
column contains the identifier you’ll use to query the data.
Browse by Category
Datasets are organized into categories. Use
cdc_categories() to see what’s available:
# List all categories
cdc_categories()
# Find datasets in a specific category
cdc_datasets_by_category("Vaccination")
# Search within a category
cdc_datasets_by_category("COVID-19", query = "cases")Browse by Tag
Tags provide more granular classification. Use
cdc_tags() and cdc_datasets_by_tag():
# List common tags
cdc_tags()
# Find datasets with a specific tag
cdc_datasets_by_tag("mortality")
# Exact tag matching
cdc_datasets_by_tag("brfss", exact = TRUE)Querying Data
Once you’ve found a dataset, use cdc_query() to fetch
data. The function supports SoQL (Socrata Query Language) for flexible
filtering, sorting, and aggregation.
Basic Queries
# Fetch first 100 rows from PLACES dataset
cdc_query("swc5-untb", limit = 100)
# Preview a dataset (shortcut for small samples)
cdc_preview("swc5-untb")Filtering with WHERE
The where parameter accepts SoQL filter expressions:
# Filter by state
cdc_query(
"swc5-untb",
where = "stateabbr = 'MN'",
limit = 100
)
# Multiple conditions
cdc_query(
"swc5-untb",
where = "stateabbr = 'MN' AND year = '2023'",
limit = 100
)
# Numeric comparisons
cdc_query(
"swc5-untb",
where = "access2_crudeprev > 20",
limit = 100
)
# List of values
cdc_query(
"swc5-untb",
where = "stateabbr IN ('MN', 'WI', 'IA')",
limit = 100
)
# Text pattern matching
cdc_query(
"swc5-untb",
where = "locationname LIKE '%Minneapolis%'",
limit = 100
)Aggregation
Combine select and group for aggregated
queries:
# Count records by state
cdc_query(
"swc5-untb",
select = c("stateabbr", "count(*) as n"),
group = "stateabbr",
order = "n DESC"
)
# Average prevalence by state
cdc_query(
"swc5-untb",
select = c("stateabbr", "avg(access2_crudeprev) as mean_access"),
where = "year = '2023'",
group = "stateabbr",
order = "mean_access DESC"
)Full-Text Search
Use the q parameter to search across all text
fields:
cdc_query("swc5-untb", q = "diabetes prevention", limit = 50)Fetching Large Datasets
For datasets larger than 50,000 rows (the API limit per request), use
cdc_fetch() which handles pagination automatically:
# Fetch all Minnesota data (with progress indicator)
places_mn <- cdc_fetch(
"swc5-untb",
where = "stateabbr = 'MN'"
)
# Limit total rows
places_sample <- cdc_fetch(
"swc5-untb",
where = "year = '2023'",
max_rows = 5000
)
# Adjust page size for performance
places_large <- cdc_fetch(
"swc5-untb",
where = "stateabbr = 'CA'",
page_size = 10000
)CSV Format
By default, queries return JSON which preserves data types. For
potentially faster downloads of large flat datasets, use
as_csv = TRUE:
# JSON (default) - preserves types
data_json <- cdc_query("swc5-untb", limit = 1000)
# CSV - may be faster for large datasets
data_csv <- cdc_query("swc5-untb", limit = 1000, as_csv = TRUE)
# CSV with fetch
data_csv_large <- cdc_fetch(
"swc5-untb",
where = "stateabbr = 'TX'",
as_csv = TRUE
)Note that CSV converts all values to character strings, so you may need to convert numeric columns after retrieval.
Working with Metadata
Dataset Metadata
Use cdc_metadata() to get detailed information about a
dataset:
meta <- cdc_metadata("swc5-untb")
# Dataset info
meta$name
meta$description
meta$rows
meta$tags
# Column information
meta$columnsQuick Helpers
Several helper functions make common tasks easier:
# Get row count (useful before fetching large datasets)
cdc_count("swc5-untb")
cdc_count("swc5-untb", where = "stateabbr = 'MN'")
# List column names
cdc_columns("swc5-untb")
# Get distinct values for a column
cdc_distinct("swc5-untb", "stateabbr")
cdc_distinct("swc5-untb", "year")
# Open dataset in browser
cdc_browse("swc5-untb")Authentication
Anonymous API requests are limited to approximately 1,000 requests per hour. For higher limits (~40,000/hour), register for a free app token at data.cdc.gov.
# Set token for current session
cdc_app_token("your-app-token-here")
# Or set in .Renviron for persistence across sessions
# CDC_APP_TOKEN=your-app-token-hereSoQL Quick Reference
SoQL (Socrata Query Language) is similar to SQL. Here’s a quick reference:
Common Datasets
Here are some popular CDC datasets to explore:
| ID | Name | Description |
|---|---|---|
swc5-untb |
PLACES | Local health data at county/census tract level |
5hvh-ygtt |
BRFSS | Behavioral Risk Factor Surveillance |
9mfq-cb36 |
COVID-19 Cases | Case surveillance public use data |
bi63-dtpu |
NCHS Mortality | Multiple cause of death data |
n8mc-b4w4 |
Vaccinations | Vaccination coverage data |
Discover more at data.cdc.gov or
use cdc_datasets().
Example Workflow
Here’s a complete example analyzing health access data:
library(cdcdata)
# 1. Find the dataset
cdc_datasets("PLACES health")
# 2. Check the metadata
meta <- cdc_metadata("swc5-untb")
print(meta$columns)
# 3. Check how much data we're dealing with
cdc_count("swc5-untb", where = "year = '2023' AND stateabbr = 'MN'")
# 4. Fetch the data
mn_health <- cdc_fetch(
"swc5-untb",
select = c("locationname", "access2_crudeprev", "checkup_crudeprev"),
where = "year = '2023' AND stateabbr = 'MN'",
order = "access2_crudeprev DESC"
)
# 5. Analyze
summary(mn_health$access2_crudeprev)Disclaimer
This is an independent, open-source project and is not affiliated with, endorsed by, or representative of the Centers for Disease Control and Prevention (CDC) or the U.S. Government. The author is a federal contractor supporting the CDC but developed this package independently. All data accessed through this package is publicly available via data.cdc.gov.