In every software engineering problem I’ve worked on, I’ve noticed a recurring tension between two highly desirable properties: flexibility and robustness. But in each situation, this tension manifests itself in different ways.
At Enigma, our goal is to build a complete set of authoritative profiles of U.S. small businesses. This requires us to integrate hundreds of different data sets and to continually create and refine machine learning–based algorithms in a highly research-driven development process.
For us, flexibility means we continually integrate new data sources and algorithms that we need to rapidly experiment with and validate. Robustness means running a complex data pipeline at scale while spending minimal time on maintenance.
The tension between flexibility and robustness arises when we’re excited about a potential research breakthrough and want to rapidly test out the effects it has on our data asset. We want to quickly deploy the change in an isolated data pipeline and measure the results against what’s currently running in our production environment.
In this post, I’ll discuss the various approaches we tried and why we’re now using data branching to address this tension.
We initially tried to resolve this by having distinct production and dev pipelines. We deployed code to the dev pipeline from separate git branches and maintained copies of the data in distinct namespaces. This solution delivered a high degree of robustness, but at the cost of the flexibility we needed. The main problems we encountered were:
Earlier this year, we began to explore lakeFS for data branching as a way to resolve this tension. Data branching overlays a git-like abstraction on top of the physical data. A data branch is a logically isolated view of the data that can be modified and merged into other branches.
LakeFS branches solve the isolation challenge in a straightforward way. Today, every developer and researcher creates separate data branches, which includes a complete snapshot of the prod data (at no additional storage cost). You can make your change and review its impact on the final data set without fear of interfering with someone else's work or polluting the production data.
It’s also much easier for us to run parallel pipelines and maintain stable pipelines for customers who want to upgrade at a slower cadence.
In addition to making experiments easier, we saw other benefits for our production pipeline:
After briefly considering implementing this ourselves as a metadata layer using git branches, we decided to partner with lakeFS to provide our data branching solution. There were a few reasons for this:
Over the past three months, we fully migrated our data pipelines to lakeFS. Overall, it’s been a successful partnership. For the most part, the product has met or exceeded our expectations. Where it hasn’t (mainly response time on certain endpoints) the lakeFS team has been fully engaged in turning around solutions — often in a matter of days. The attentiveness and sense of urgency is refreshing.
Moving to a data branching solution has paid off quickly for us. A few days after completing the migration, we’ve already reduced testing time by 80% on two different projects. And we’re excited to see how data branching increases our product velocity in the coming quarter.