Validation is a critical part of working with data. In data science, it’s how we check our work across huge datasets, confirming that our output is accurate and high quality.
There’s no set formula for how to validate a dataset. It will depend on your company’s business model and stage of growth — making it tricky to find the right approach.
Our work in data science at Enigma drives top-line revenue. Validation isn’t just a ‘nice to have’ process behind the scenes: it’s a core part of our data science workflow. It’s important that we build and maintain a scalable validation process that we’re confident in.
Getting here has been a journey, and we continue to refine our processes.
As Enigma transitioned into the small business data company we are today, our validation process has changed and matured.
As the dataset has continued to grow, new customers have been interested in different subsets of the data. So rather than tailoring validation by customer, the team began to experiment with building a consistent testing sample.
Here are a few key lessons we’ve picked up along our validation journey.
1. Keep a customer focus
From the beginning, keeping the customer experience front and center has been key.
Early on, when we’d release an updated dataset, the team looked at many of the indicators that customers are supposed to see in the data and generated hypotheses:
Today, we validate during research. We define a sample that we’re comfortable with, making sure we’re validating based on what our clients are seeing, not just baseline distribution.
Having a customer focus also means validating as far towards the end of the pipeline as possible. Before anything moves into production we check the end of the line to make sure we’re seeing it through the eyes of our customer.
Expect that each customer in a B2B setting will vary in terms of the samples they’re interested in. It can help to use a poll of customers.
2. Make your validation repeatable
Repeatability will depend on your organization’s growth stage. If you’re a startup still working to achieve product market fit, it can be hard to know what is repeatable.
Approaching validation from scratch each time can mean you end up with wasted validation and have to revalidate. The customer market can also change, affecting your samples.
A major challenge with validation is we’re looking at hundreds of millions of pairs of data. Working across the entire data set wouldn’t be feasible for each validation. Instead we had to figure out how to cluster.
We’ve also evolved to introduce more automation into our validation process, like outsourced and automated data labeling.
3. Document, document, document
As you begin to crystallize a customer-focused, repeatable process, it’s important to invest in clear product documentation on key data science decisions.
Without documentation, you’ll get ad hoc decisions and internal contradictions. Efficiency begins with stringent guidelines that are documented and rooted in customer investigation.
And showing your work is also important for external stakeholders. We’re very open about our data models. Customers are using insights from our data to make decisions, so that transparency is important — it builds trust.