If you were for some reason in search of an archetype of data as a public service, the American Community Survey (ACS) would be a good place to start. We’d venture to bet that there is no other dataset that contributes to projects in so many arenas, from journalism to non-profit aid allocation and corporate strategy—in addition to guiding $400 billion in government expenditures annually.
The volume of data released by the Census Bureau (1000+ tables for one year’s estimates of the ACS alone!) is staggering. We are impressed by the quality of data, the pages of the detailed documentations, the fact that ACS has its very own support line, and yet...
And yet, none of it really serves the needs of users looking to access and use the data in bulk. By “in bulk” we mean we aren’t looking for a demographic factoid for one specific city, but rather, we want every demographic factoid for every city—and the data for all the states and census tracts too. This is why there are a number of third parties, including us at Enigma, who republish the data in an effort to make it easier for others to put it to use. Certainly, as employees of an operational data management company who help other companies put complex datasets like ACS to use, we don’t mind.
That said, in the hope of making the data easier for all to use, we have a few suggestions for you, Census Bureau. We've penned a full paper on the subject, but here are the key points:
1. Open Formats for Open Data: Much of our complications with handling the ACS data come from the fact that even plain text data files appear to been optimized for use with Excel. But why? Excel isn’t a great tool for bulk data. It’s a proprietary software, and the data released on the FTP is open data. So, could we please stick with open formats for open data?
2. Descriptions and Data Living Side by Side: Despite the wealth of PDF documents detailing the data delivery nuances, we found it unnecessarily complicated to construct even basic data descriptions such as tables descriptions and column names. Metadata should live with the data, not all over the places. As for the format for the the metadata... see (1)
3. Structure of the Data Made Clear: Bear with us here as we dive into the nitty-gritty. The ACS is in essence one large table, or could be one large table with the questions’ categories constituting the columns and the sampling geographies forming the rows—with a related population share or count estimate at the intersection. Sure, this table would be a bit unwieldy to use, but we don’t see any reason not to release a single data file per thematic table per release.
Census Bureau, if you made these simple but fundamental changes, you’d break a lot of ingestion pipelines (our own certainly included), but you would make the open data truly open. The ACS aims to representatively survey all of the US, painting a detailed picture of who we are as a country. Let’s make the open data just as comprehensive.
All the best,
Eve + Olga
P.S. If you want to read more, check out our paper on data delivery and the ACS here.
P.P.S. If you'd like to check out the work we've done on the ACS dataset, hop on over to our public data app, Enigma Public.