Data 101: Metadata

This is the first post in a series covering the fundamentals of operational data management. We’ll be walking through context, linking, liquidity and how these core concepts come together to enable enterprises to put data to work to drive more efficient workflows and generate repeatable insights at scale.

For data to be useful in informing processes or answering questions it must be contextualized. By providing the context to understand what the data is — information about how it’s generated, collected, and composed — you can identify relationships across multiple heterogeneous datasets and better understand how to apply the data to answer questions or solve for challenges in a scalable, repeatable way. Metadata plays a key role in this equation.

What is metadata?

Definition: A set of data [that gives information] about other data; A conceptual representation of knowledge. This information is added to key data fields — though not overtly visible to users — to enable machines (and humans) to understand the meaning of information.

Metadata describes the source of data (the technologies and methods) and helps establish a common understanding of the meaning of the data to ensure correct and consistent interpretation and usage of information. It can also encompass details about the usage and transformation of data.

I’m not sure we collect metadata. Is it really needed?

Yes! Metadata is fundamental to understanding data. Moreover, it is essential for leveraging data as an enterprise asset. Beyond being largely useful for finding relevant information, discovering related resources, verifying insights, and auditing analyses, metadata enables you (or a machine) to do a number of important things:

Reuse domain knowledge: See several different perspectives from same data
Understand relationships between entities
Leverage algorithms to de-duplicate, link, or match records
Assimilate new data faster
Mine data systematically
Understand how data has changed over time
Draw wider conclusions from data

What exactly does metadata tell you about data?

Context comes in a number of forms, each providing different color around a dataset. For now, we’ll focus on three key layers of metadata: user-contributed, derived, and provenance and lineage.

User-contributed: As the name suggests, this is metadata added by users who are familiar with the data. You can think of this as expert knowledge about the content in the form of annotations that help other users know what the data is and how to use it. This kind of information may include definitions of terms of columns or other details that may help to identify similar datasets.

Derived: With this type of metadata (mined from each dataset), we are essentially asking, “What can the data tell us about itself?”. Derived metadata can encompass information such as size, quality, and types of data, number of records, how often the data is updated, last modified date, anomalies, date ranges, and frequent keywords or tokens.

Provenance and Lineage: This third layer of metadata is tied closely to the creation, transformation, and usage of the data. It encompasses change history, social details around how the data is being used and applied across an organization, and dependencies: Is it a parent dataset? Is it used to create other datasets? Is it part of other datasets? This type of information is particularly important when trying to understand how a change in one dataset could have a much wider impact.

In subsequent posts we will explore additional layers of information such as ontologies and the mapping of the real world objects within datasets through specific relationships.

So, metadata helps you understand what you’re looking at.

While the meaning of data may seem obvious when presented within the system that collected it, if you were to look at that same data in a different setting, it might be challenging to understand.

Metadata describes what the data is so a human or machine can a) make sense of the information and b) identify and understand relationships that exist between concepts. This is particularly important when you’re integrating data from distinct sources, which may present equivalent concepts or data points in different ways.

Let’s look at phone numbers, for example. It’s likely you’ll encounter many different formats of telephone numbers out in the world. Take: 1 (212) 222 2222, +1 212 2222 or 1 212 222-2222. These may be written in slightly different terms, but any person who has used a phone would intuitively recognize these things as the same and would similarly recognize that 22-22 is not a valid phone number. Computers, however, don’t have the same intuition. Here, we could rely on semantic data management (the process of mapping a dataset or a type of data to a real world object or outcome) to map all of the formats above to the idea of “phone number.”

Or, imagine trying to transfer data from a form into a spreadsheet. The spreadsheet asks for the same information as the form, but the header names of each column are completely different from the field names on the form. Without semantic types, would you know where to put the information? Would someone else looking at the data know that the form and the spreadsheet contained the exact same information?

Sounds necessary for analyzing datasets from different sources.

Absolutely. Metadata enables people and software to share a common understanding of the structure of information, making it much simpler to extract, aggregate, and link information from different sources and systems. Metadata provides color to a number of things around the data for a person or machine that may not be familiar with it. This is particularly valuable because it allows someone (or something) to use a data set without having to refer back to the source system. It eliminates the need to integrate numerous systems or applications when analyzing information across multiple datasets.

In other words, metadata helps make data reusable for any number of purposes: a user can rely on metadata to re-contextualize data to answer a specific question without going back to the source system to get only a small subset of the data.

It offers a consistent view for information held in data siloed across teams or organizations, thus providing a greater body of knowledge from which to uncover answers or identify trends.

Big picture, what does metadata mean for operationalizing data?

Metadata enables you to apply knowledge and insights in a scalable, repeatable way. Think of it as a layer of information serving as the connective tissue for enterprise data analysis — a layer that makes it possible to create a more flexible and transparent framework.

This layer of embedded intelligence can then help power and optimize your data infrastructure, forming a key part of a feedback loop in which exposure to additional data will further enhance the metadata. In other words, your knowledge base will continuously grow (and grow more intelligent) as you assimilate new data from different sources.