This is the second post in a series on determining duplicate records within data of unprecedented scale and heterogeneity. The following posts will outline Enigma’s methodology, presenting specific examples of our work with adverse events data to enable more accurate detection and evaluation of patient safety risks.
As we noted in Addressing Duplicate Adverse Drug Event Data at Scale, a number of significant challenges exist in the Adverse Events (AE) data world, making analysis difficult. In this post, we dive into detail about the data we work with and the process to clean, normalize, and standardize it in preparation for de-duplication.
Enigma applies our duplicate detection approach to AE data from multiple sources: The FDA Adverse Events Reporting System (LAERS: 2004-2011, FAERS: 2012-Present), The World Health Organization’s (WHO) VigiBase (1968-Present), and private case data.
We prepare the raw data through a series of cleanings and normalizations as well as additions to the data from various dictionaries. Further we standardize the schema within these data sources into a common format that enables us to provide a holistic view of the raw data through a series of joins. These data standardization efforts not only make it possible for analysts to navigate this data from one source, but also serve to prepare this data for our duplicate detection pipeline.
In the case of the FDA Adverse Events data (AERS), our data preparation work allows us to identify unique records across quarters of data that are released separately. It also enables the detection of “true” duplicates or exact matches between case reports. We will dive into these issues and our solutions in greater detail throughout the blog series.
With Concourse, our proprietary data operations platform, we ingest the raw data and relevant data dictionaries. This ingestion process is automated and refreshed immediately as the source is updated. Upon ingestion, we streamline numerous data cleaning and standardization efforts that facilitate more accurate linking across cases. This work includes:
- Regularizing fields such as dates, countries, age and weight units, into a standard format and common units.
- Basic cleaning of the text strings by removing unnecessary whitespace, redundant punctuation, lowercasing, etc.
- Matching terminology codes with their descriptions from different sources and documentation.
We also reference authoritative sources for drug names and side effect categorization to standardize these fields, by:
- Cleaning and standardizing side effect names according to MedDRA classifications
- Appending full MedDRA ontologies for greater granularity
MedDRA is the data dictionary used by clinicians to record side effects data. It is organized in a taxonomy such that side effects can be coded at different levels of specificity. MedDRA updates bi-annually, wherein terms can be re-classified under different trees with a new release. In AERS, the data is collected at the Preferred Term (PT) level, the second most granular level of specificity. On the other hand, VigiBase uses its own coding standard for side effects (WHO-ART). However, we are able to attain corresponding MedDRA LLT (Low-Level Term) and MedDRA PT terms using an existing crosswalk and MedDRA_ID and Adr_ID fields presented in VigiBase.
We set out to achieve a complete MedDRA ontology hierarchy to append to our ultimate view of these datasets. To do so, we first normalize MedDRA dictionaries across the AERS data. This requires mapping MedDRA PT terms found within the REAC table to the most recent MedDRA dictionary to extract the higher level terms, MedDRA HLT (Higher Level Term), MedDRA HLGT (High Level Group Term), as well MedDRA SOC (System Organ Class) fields associated with these terms.
By doing a naive string matching on the PT terms in the full batch of FDA data terms against the latest version of MedDRA, we’ve achieved an almost 95% adverse events ontology coverage rate.
Nick Becker, Kelvin Chan, Olga Ianiuk, Alexis Mikaelian, Urvish Parikh, and Austin Webb contributed to the approach (and efforts) outlined in this blog series.