Addressing Duplicate Adverse Drug Event Data at Scale


This is the first post in a series on determining duplicate records within data of unprecedented scale and heterogeneity. The following posts will outline Enigma’s methodology, presenting specific examples of our work with adverse events data to enable more accurate detection and evaluation of patient safety risks.

Efficient and reliable pharmacovigilance (PV) processes are critical for allowing today’s pharmaceutical and biotechnology companies to accurately understand and respond to the adverse events associated with their drugs. As such, these processes have important implications for managing patient safety, compliance costs, and business or reputation risks. For reference, an adverse event (AE) is any untoward medical occurrence in a patient or clinical investigation subject administered a pharmaceutical product and which does not necessarily have a causal relationship with this treatment.

Unfortunately, significant challenges exist in the AE data world for several reasons that make analyzing adverse events very difficult. Such challenges include: (1) Inconsistent granularity across data through incomplete data entry or transcription errors (2) Stale data dictionaries that are neither updated nor standardized across different sources (3) Various reporting sources that propagate inconsistencies and redundant or duplicate reports.

The messiness and duplication of AE reporting today impedes accurate analysis and detection of drug trends and signals. In order to improve these capabilities, and in turn patient safety and manufacturing quality, a cleaned, de-duplicated, and holistic view of adverse events is required.

To address the challenges in AE data and pharmacovigilance workflows, Enigma provides a single source-of-truth for AE data, taking a layered approach to standardizing, harmonizing, and detecting duplicates across AE data sources at scale.

In this blog series we will review our methodology, which combines a sequence of techniques that clean, format, and integrate AE data from public and private sources, and then probabilistically determines duplicate records both within and across this data to power more accurate detection and evaluation of safety risks. At a high level, we leverage successive filters of precision, both in how the data is processed to detect duplicates and in how the results are presented to the end-user for verification.

Over the coming weeks we will dive into de-duplication of AE data, beginning with our approach to standardization and preparation followed by a comprehensive look into each component of our methodology. Additionally, we will share applications of our current work and initial results. Stay tuned.

A preview of what is to come

As an overview, our methodology begins with a series of data transformations and cleanings within and across data sources to map all AE data to a standardized ontology.  We then seek to identify likely duplicates in the data by first using Locality-Sensitive-Hashing (LSH) to reduce the duplicate search space. Next, we apply a TF-IDF based algorithm, which we call Term-Pair Set Adjustment, to all pairs of records within the search spaces defined by LSH. This enables us to generate features for the classification task of determining duplicate record pairs. These features indicate similarity and are calculated on the basis of shared and unshared terms, adjusted for the relative frequencies of these terms in the data and then used in a Random Forest classifier, which ultimately outputs a probability that a given pair of records are duplicates.

Ready to explore some adverse events data? Visit Enigma Public to take a look at our raw FDA Adverse Event Reporting System (FAERS) data or curated AE collection. You might also enjoy A Prescription for Healthcare Data, our interactive visualization tracing the drug life cycles of more than eighty commonly-prescribed drugs.


Nick Becker, Kelvin Chan, Olga Ianiuk, Alexis Mikaelian, Urvish Parikh, and Austin Webb contributed to the approach (and efforts) outlined in this blog series.