As a company rooted in public data, Enigma’s data engineering use case is the collection of small, heterogeneous, messy datasets rather than streaming real-time data. Instead of asking how to scale vertically to handle large data volume, our central question is one of horizontal scale: how can we acquire more public datasets, of various sizes, quickly, accurately, and stably?
We believe generalized and reliable tooling for data ingestion as the answer; however, the process of getting to our current solution has been one of fits and starts. Over several iterations, we have arrived at a system that is:
Enigma has experimented with different models for how to do data ingestion since our founding in 2011. About about four years ago, our team developed Parsekit, a proprietary Python toolkit developed by data engineers for fast ETL pipeline authoring. Designed to be simple to use and self-documenting, Parsekit employed a YAML configuration file that, once fed into the system, generated a fully operational pipeline easily scheduled to run on a server resourced with several workers. It also had some added flexibility, allowing users to insert ‘custom steps’ into the processor for edge cases.
By the time I joined Enigma a year ago, however, two problems with Parsekit had become clear.
First, authors were generally using only a handful of the standard Parsekit steps provided by the library; instead, all except the most basic users wound up shoehorning ‘custom code’ into the Parsekit pipeline at some point in the ingestion work. As the amount of custom code grew, maintenance became a nightmare.
Second, the Parsekit platform was meant to be like a tank: slow, steady, difficult to break, and highly abstracted from the well-engineered internals. But when it broke, it really broke.
Normally this would be fine, perhaps even desirable, but a contribution to the codebase was often a slog, both process-wise and because of the system’s engineered complexity. It was easier to just work around the problem at hand rather than try to solve the underlying issue, particularly when operating under tight timelines and client agreements for the delivery of data. Moreover, because of the way it was engineered, the system operated slowly: after a million rows, Parsekit would take hours, sometimes even days, to process a dataset.
Parsekit, despite coming from a desire to make ETL faster and maintainable, wound up with an over-engineered and inaccessible codebase; it was too complex and too abstract---with code development too far from day-to-day use. It became a bad fit for the data problems we face; but we had locked ourselves out of using alternative methods.
We have two big needs for the tooling that will replace Parsekit:
These aren’t entirely engineering concerns; they are, essentially, process problems. We require tooling that reflects the unique process needs of our organization. It’s not enough to provide maintenance, efficiency, and stability guarantees.
From these two requirements, it became clear that the solution was to fully decouple the underlying orchestration from the code being executed. This freed us up to make our library less prescriptive and more of a garden of implementations, if you will: a user can go in and pick the functional implementations they desire, group them however they wish, and place them into (separately provided) cookware for execution---more on this in another blog post.
Thus, Kirby was born. Composed of well-tested functions, not steps or pipelines, Kirby operates as a kind of ETL buffet for users with clear contracts and small, totally orthogonal pieces, making contribution easy.
Kirby has three major qualities that we think will make it a long-lasting solution to our ETL use case:
It is Applied: contributions are only made on an as-required basis; implementations must be directly tied to an engineer’s needs to be accepted. No what-if’s, just what’s needed. This quickly yields a set of commonly needed functional themes.
It is also Collaborative: anyone in the organization can contribute directly to the code base if they wish to. By ‘enigma-sourcing’ the toolkit, we prevent the code from being inaccessible to new engineers while simultaneously reducing the desire to over-engineer the system. It also forces good documentation habits.
It is Agnostic: any library and technique, so long as it can be implemented via Python, is acceptable and will work. This allows us to take advantage of a variety of open source systems, from pandas to PySpark to dask and others, depending on the needs of the dataset being ingested.
Since we began development, Kirby has vastly accelerated our team’s pace. In tying the development directly to the needs of our data ingestion backlog, we’ve quickly arrived at a fairly stable set of common, highly orthogonal implementations that are used across pipelines to acquire vastly different datasets.
Nevertheless, we’ve found that by improving our tooling we can only achieve linear acceleration with our small team of data engineers. We might get three or even five times our current speed of ingestion, but there is no ‘hockey stick’ growth here. No matter how much of the ETL work we abstract out into shared implementations, our data are just too messy and too unpredictable for us to achieve the kind of horizontal scale we want.
Since Kirby’s genesis was process driven, it is fitting then that the next step for us is to further adapt our processes after bringing our development approach in line. In the past few months, we’ve created the foundations necessary to be able to grow our data engineering organization horizontally to ingest all the data we need: creating a remote-friendly work culture and exploring the possibility of tapping additional markets outside of New York for talent. Call it ‘distributed gardening’.