How we know what we know.
The Buildout is an automated pipeline that turns U.S. federal spending records into something a person can actually read. This page explains where the data comes from, what we do to it, and what we don't do.
Every federal contract, grant, rule, and opportunity, normalized into one database and synthesized into a weekly brief.
The U.S. federal government is the single largest purchaser of goods and services on Earth. In FY2024 it obligated roughly $760 billion in prime contracts alone, and another $1.2 trillion in grants, loans, and direct payments to state agencies, universities, and individuals.
All of that money flows through public databases — USAspending.gov, the Federal Register, Grants.gov, SAM.gov, agency RSS feeds. The data is free. It is also nearly impossible to read. The interfaces are slow, the schemas are inconsistent, the same recipient appears under twelve different name variants. The Buildout is a thin, opinionated layer on top of all of it: one ingested record per award, one canonical entity per recipient, a sector classification on every award, and a weekly editorial brief.
The pipeline runs continuously against five upstream sources:
- USAspending.gov — every prime contract, grant, loan, direct payment, and IDV the federal government publishes. We page through the API by award-type group (the API rejects mixed groups in one call) and persist the full payload.
- Federal Register — rules, proposed rules, notices, and presidential documents, with full title, abstract, citation, and effective date.
- Grants.gov — every posted and forecasted grant opportunity, including CFDA / assistance-listing numbers and close dates.
- SAM.gov — active contract opportunities and award notices (requires an API key from sam.gov).
- Federal agency RSS — press releases from a curated set of federal agencies, deduplicated by feed and GUID.
Every API call is cached on disk by request hash so re-running the ingest is free. Every record we write keeps its full raw payload — if a field is wrong downstream, the source of truth is one column away.
The same recipient appears in federal data under many name variants: Lockheed Martin Corp, LOCKHEED MARTIN CORPORATION, Lockheed Martin Corp., and so on. We resolve them in two steps:
- UEI match. Every modern federal recipient has a Unique Entity Identifier — a 12-character federal-authoritative key. If two records share a UEI, they are the same entity.
- Normalized-name match. For records without a UEI we uppercase the name, strip punctuation, collapse whitespace, and canonicalize legal suffixes (
CORPORATION→CORP,L.L.C.→LLC, etc.). Two records that produce the same normalized string are linked to the same entity.
We don't fuzzy-match across the UEI boundary — a UEI-less query won't poach an entity we've already identified by UEI. The dedupe rate on our current corpus is approximately 36%.
Every award is run through a versioned prompt that returns a structured JSON object with: a sector tag, a plain-English purpose summary, a one-sentence note on strategic significance, a supply-chain implication, and a U.S.–China-competition angle when relevant. The current sector taxonomy is aerospace, defense, energy, biotech, semiconductors, critical-minerals, ai-infrastructure, transportation, telecom, and other.
Every classification is stored with the prompt name, prompt version, and model identifier that produced it. When we bump a prompt version, prior classifications stay in the database so we can back-test changes against the old output.
We use a small frontier model for the bulk of classifications (current cost: roughly one-tenth of a U.S. cent per award) and reserve larger models for synthesis tasks — the weekly newsletter intro, per-entity narratives, deep dives.
- We don't take editorial direction from recipients. Contractors, grantees, agencies, and lobbying firms cannot change how their data is presented. Every page is generated from the public record.
- We don't anonymize or aggregate to obscure. Names, dollar amounts, and dates are shown verbatim. If a contract is public, it stays public here.
- We don't editorialize past what the data supports. LLM-written summaries reference only the fields in the record. They don't speculate about motives, performance, or politics.
- We don't sell access to anyone's identity. No tracking pixels, no third-party analytics in the newsletter, no marketing attribution.
Ingestion runs continuously. The HTTP cache holds responses for 12 hours, which is the practical upper bound on data staleness on this site. The weekly newsletter is generated and published Sundays UTC.
If you spot a record that looks wrong, the source of truth is USAspending.gov / federalregister.gov / grants.gov directly. We don't mutate fields — what you see is what the source said. If the source corrects, we re-ingest. Persistent errors should be reported to the upstream agency; we'll mirror the correction once it lands in the public record.
Or skip the sources and start with the synthesis: read this week's brief →