Public Data and the Challenges of Moving from Information to Insight

Take it from someone who spent their summer wading through public data: even the most open data portal can hide a few surprises.

As Enigma’s Data Journalism Fellow this summer, my task was quite open-ended: find something interesting in a public dataset to write about. Given the number of datasets within Enigma Public, I was never lacking for options. I pursued particular datasets based on what was happening in the news, or simply because they seemed interesting.

As I dug into those datasets, a few things stood out to me that I thought might be worth sharing with an audience interested in public data—and the possibility of making that data better.

1. Disclosure laws are only as effective as our use of the data (or documents) they provide.

The very first piece I wrote this summer focused on General Michael Flynn’s delayed registration as a foreign agent under the Foreign Agents Registration Act, or FARA. The United States has a range of laws, like FARA, that mandate disclosure related to specific kinds of activity— the Lobbying Disclosure Act and Home Mortgage Disclosure Act, to name a few. Each of these laws makes the data and documents from the disclosures available to the public, presumably for further analysis and scrutiny of the provided information.

In researching the coverage of Michael Flynn’s FARA compliance and the Department of Justice’s (DOJ) implementation of the law, I was surprised to find how much importance was given to the simple act of disclosure relative to actual information provided. Even missing disclosures are not always met with meaningful consequences.The DOJ, by its own internal review, seldom pursued legal action when individuals were not in compliance with FARA. Instead, they sought to elicit the proper disclosure.

The purpose of FARA is not simply to have foreign agents submit additional paperwork — presumably, it creates an opportunity for the public to understand how foreign parties are influencing the policies of U.S. government policy. The data downloads from the DOJ website make it easy to see what countries and individuals are involved, but it takes further digging to find the real value of FARA disclosures, which seem to often be buried in image-based PDFs of supplemental forms.

Meta Rk 2

A portion of a supplemental form submitted in 2017 by the Daschle Group, describing lobbying efforts on behalf of the Taiwan Economic and Cultural Representative Office (TECRO).

The supplemental forms often list the meetings foreign agents have had with members of Congress and members of the press, op-eds they have written, and money they have spent.

That information, one might argue, is the most important for actually answering how foreign actors influence the policies we enact.

Meta Rk 3

A description of an op-ed Michael Flynn’s lobbying firm worked on provided in a supplemental form.

2. Open data does not mean complete data.

While some public datasets/sources provide an overwhelming volume of information to process, others present not quite enough for the kinds of analysis a curious data nerd might be tempted to try.

The Consumer Financial Protection Bureau (CFPB) makes a database of consumer complaints about financial products available to the public, much to the chagrin of big banks. Each complaint is stripped of personally-identifiable information (which sometimes includes the last three digits of the consumer’s zip code for sparsely populated areas). Most, but not all, complaints are published to their online database. As a matter of policy, complaints that cannot be verified (in terms of an actual transactional relationship between the complaining consumer and the related institution) or have been redirected to another federal agency are not published in the database.

These are reasonable choices to make in terms of what should and should not be published— and they have some interesting consequences.

For my analysis, I had a very simple question: what financial product was most frequently the source of consumer complaints? Interestingly, based on the published data, I came to a different conclusion than what was reported in the CFPB’s reports to Congress.

Meta Rk 4

Chart included in the CFPB’s Semi-Annual Report to Congress, noting “Debt Collection” as the most common source of complaints.

Meta Rk 1

My chart, for the same reporting period, showing the most common source of complaints in the published data.

The difference is due to the CFPB’s decisions on what to publish. The published data is a subset of the complete universe of data, and notably, it is not necessarily representative of what that universe actually looks like — at least, not for the question I wanted to answer.

To their credit, representatives from the CFPB were very helpful about answering my questions about this discrepancy. It is nonetheless striking that a member of the public examining the data could come to such different a conclusion than the one provided by the CFPB.

3. Data providers make their data more valuable by telling users about what they have.

One of the most impressive parts of the CFPB’s database is the robust documentation it contains about their data. I had some tools to start to making sense of the discrepancy because CFPB has provided a great deal of transparency with regard to their decision-making about the data.

This, unfortunately, is not universally true of all data providers. For example, Politiwoops provides a treasure trove of data: the deleted tweets of politicians. But what would make the data even more valuable is some additional information about the data: what order are the tweets returned, or how many results did a particularly API query find? Answers to these relatively simple questions can make it easier to understand the universe of data and better understand the sorts of questions the dataset if capable of answering.

Enigma attempts to solve these problems by “metabasing” the public datasets they acquire — the data in Enigma Public is accompanied by annotations about what the columns in the dataset actually mean. And, when that isn’t enough, there’s a link back to the source. The goal of open data enthusiasts should not be simply to have as much data as possible, but hopefully, to also make that data useable. It’s a vision of accessibility in a more robust sense of the word.

Let's do better.

Public data matters. Even though there is always room for improvement, there are already some amazing examples of public data put to good use.

It’s a troubling moment in American history when it seems that the government is interested in making data less accessible, not more. After all, an informed citizenry is a prerequisite to meaningful, participatory democracy—and public data has proven itself to be a way of achieving that goal. We need to keep making use of the data we have and advocate for ways to make public data more usable in the future.

For those who are interested in being better stewards of public data, let us remember that more is possible.


