An ontology is a way in which to describe the world. From one perspective, language is an ontology; a set of labels to give meaning to real world things.
But if you don't speak the same language as another person, your communication will be reduced to less descriptive forms, like “talking with your hands.” You might be able to convey simple ideas, but as tasks become more complex, ambiguities become more common. Is that hand signal the number two, a rabbit, or the peace sign?
These ambiguities are a major part of why we find it amusing to play games like Pictionary or charades. We interpret the information given and fill in the gaps using context clues or our sense of humor and imagination. In a gameplay setting, it may be amusing to misinterpret that silly pose of a friend, or a poorly drawn horse. However, when collaborating to solve a complex problem, these constraints wreak havoc on efficient operations, especially when there is little coordination between parties. The path towards many failures is paved with ambiguities, misunderstandings, and inconsistent representations of data.
An ontology solves this problem by creating a shared vocabulary through which you can describe the semantics of your data and build applications.₁ By making your applications depend on an ontology as opposed to raw data columns, you are creating an abstraction that enables the flexible re-use of your applications and your data to different data sources and use cases.
We are living in a world of information overload, and it's easier than ever to create information—sometimes even mandated by law. How do we best make use of this information? If I'm searching through multiple data sources, each from different creators, how can I be sure that columns in one dataset correspond to columns in another? You could make a standards guide to ensure everyone is creating data with consistent descriptive metadata, but people are still prone to typos and other errors may occur.
The dutiful secretary or analyst—logging data in a spreadsheet—is likely to name columns in ways they find meaningful to others. However, they may also try to save themselves a few keystrokes and drop letters from column names. This could lead to something like "org_nm", which really means "organization_name" to them—or was it "organism_name"? Is that organization like a company or a chess club? How do I ensure that when one spreadsheet has a column named "org_nm" it means the same thing as another spreadsheet's "company_name"? Are those names accurate?
This matters significantly when you’re trying to make use of multiple datasets to piece together a more complete picture of the world. It may not seem like a big deal on a handful of datasets, but when it takes 50 to 100 or more datasets to get a complete picture and the datasets change drastically over time, it demands a more robust solution than solely a human in the loop.
When you are sitting atop thousands of datasets from many different sources—like Enigma is—you have to start to ask yourself questions like:
- Where are all the companies in the data?
- Of those companies, what are their addresses?
- Is that the mailing address or the headquarters address?
- How do we know this, and how can we know this automatically?
By simply labelling the columns in the datasets and their explicit relations to other columns, we can take a—not perfect, but still epic—leap into answering these more semantically rich queries that span N datasets and disambiguating references to entities that are the same type.
When one organization is publishing results or services that are expected to be used by others, it is important that others know precisely what is meant when a service refers to Sleep Disorders, and which disorders that includes. If an individual reports they’ve experienced a rash on their foot as a result of a new medication and another individual reports they’ve experienced excessive dry skin, how does a rash relate to dry skin? What is the classification of the drug? Is it a Foot Cream or is it a Proton Pump Inhibitor?
When a regulation is issued for a specific category of drugs, how do we know my company’s drug is actually under regulation now? As far as I know, this is only feasible through consistent classification of drugs, side effects, medical procedures and other entities. Luckily, there has been a working group maintaining a freely accessible medical ontology called MedDRA since around 1999.
Synthetic Bad Data:
When an organization needs to build software that operates on data it might not be able to view, there needs to be a way to “mock” the data. One could write out a bunch of fake datasets and run tests to ensure the application works as expected amidst this particular fake data. This works, but when you’re building many applications that each require some degree of fake data and all your applications need to handle a certain category of bad data, it becomes cumbersome to maintain.
One way to alleviate this is to use your ontology to generate datasets in a semantically consistent manner, abiding by the relationships and types defined in your ontology. You can mimic the kinds of issues you might see in the data by defining a few different categories of bad data as a kind of noise function in your data generating process. What happens to my application when there are 10% null values? What about when there are non alpha-numerics in weird places? What about different representations of the same company name? When we start throwing things like “company” into the mix, or any indication of something that’s not just a transformation on a primitive data type, we need to know what we’re talking about.
Additionally, the way a company string can be bad is very different than how a location string can be bad. By using an ontology to link your applications, you are able to have multiple processes independently contributing new capabilities on a per entity basis. As folks discover new ways that raw data can be dirty, those learnings can help make a synthetic bad data generating system more ontologically aware, further empowering any applications that would rely on that entity’s cleanliness.₂
“Any enterprise CEO really ought to be able to ask a question that involves connecting data across the organization, be able to run a company effectively, and especially to be able to respond to unexpected events. Most organizations are missing this ability to connect all the data together.” - Sir Tim Berners Lee
When creating or using data in an organization, enforcing a consistent vocabulary allows for serendipitous innovations to occur that may not have been fathomable before the data was linked by its vocabulary. You can start to ask questions of the data that were previously not answerable, and the time it takes to answer these questions reduces significantly.
Creating ontologies, mapping them to datasets and building ontology-driven applications are ways to prevent miscommunications stemming from schema inconsistencies. They also allow for re-usable applications that operate on entities and their relations, instead of specific rows and columns. Your ontology mapping and definitions become queryable metadata that allow for enterprise-wide inventories of applications and which entities they depend on.
Making sense of data is more important today than ever. Keep an eye on our blog for the next post in this series on how to make an ontology based application.
1. Semantic Web:
Sir Tim Berners Lee, the inventor of the World Wide Web, mentioned Linked Open Data as the next frontier of the web. Known as the Semantic Web movement, it has been gaining momentum since the early days of the Internet. Like an ontology, the Internet was started by using a consistent vocabulary—in this case, a protocol—which allowed for a web browser on your computer to render web pages and know what to do when you click a link.
Here are some useful links to learn more about the semantic web and how ontology-linked data is already being used on The Internet:
2. Synthetic Data:
There are also open source tools that can help with generating fake data today, however do not include the ability to fake bad data. You can also go a surprisingly long way with a recurrent neural network:
Interested in joining the Enigma team? We're hiring.