The decision to rebuild a product from the ground up is a high risk/ high reward undertaking. Not only is it expensive and stressful but most rebuilds fail. After an intense twelve-month build cycle, I'm reflecting on why we made this decision and what we've learned along the way.
Drive through any stretch of highway in America, and you'll pass dozens of establishments that represent years—sometimes generations—of labor and care. The tens of millions of US private businesses are the engines of our economy that create employment, drive innovation, and enable social mobility.
Yet in our increasingly digital world, these businesses are poorly represented in our information systems. The local auto repair shop, the family-owned restaurant, the innovative startup in a converted warehouse—these vital entities often exist as fragmented, inconsistent data points across disparate systems.
This fragmentation isn't just an abstract problem. It creates real friction in our economy: lenders struggle to assess risk accurately, digital companies can't serve these businesses effectively, and municipalities lack the insights needed to support local economies.
Eventually every system encounters fundamental limitations where incremental improvements yield increasingly marginal benefits. Our existing architecture had served us well, but to achieve our vision of creating the definitive mapping of US businesses, we needed to rethink our approach.
The old system struggled with the inherent complexity of how businesses exist in the real world—the messy relationship between brands, locations, and legal entities. Data freshness was inconsistent. And we couldn't leverage the recent advances in AI and machine learning in a thoughtful, integrated way.
After months of analysis and prototyping, we made the difficult decision to rebuild from the foundation up. (we called this “burning the boats”).
One of the most critical insights that drove our rebuild was recognizing that every business has two distinct identities: its brand identity (how it presents to customers) and its legal identity (how it interacts with financial and legal systems).
Traditional approaches typically conflate these identities or prioritize one over the other. But understanding a business requires comprehending both facets and how they interrelate.
For example, your favorite coffee chain might operate under a single recognizable brand, but behind that cohesive customer experience lies a complex web of franchise agreements, holding companies, and local LLCs. For compliance purposes, you need to understand the legal structure. For market intelligence, the brand relationships matter more.
Our data model explicitly separates and connects these entities, allowing flexibility in how the data can be used. You can apply KYB (Know Your Business) filters to prospecting lists to pre-qualify customers. Or we can provide operational and market signals about a business undergoing compliance checks.
We've published our data model documentation and exposed it through an expressive GraphQL interface, allowing developers to query this complex relationship network in intuitive ways.
The half-life of business data is notoriously short. Locations open and close, ownership changes hands, and web presences evolve continuously.
We've architected our systems to reevaluate every physical address in the US at least every 90 days which allows us to promptly discover new businesses and identify locations that have closed. We inspect each US business domain with the same frequency.
This required building sophisticated orchestration systems to manage billions of data points and process terabytes of information efficiently. We've developed proprietary confidence scoring algorithms that help us prioritize where to direct our computational resources and human attention.
A central element of our development process was to re-evaluate our sources of small business data and discover new sources. During this process, we reconfirmed that (unfortunately) many of the most widely used sources of business data suffer from serious quality issues. This strengthened our commitment to provide a superior product.
We've taken a different path—one that starts with data quality as the foundation. We confirmed that traditional data cleaning techniques and classical statistics are effective for many problems. This creates a solid foundation on which to apply newer AI models in targeted areas where they’ve proved to be exceptionally powerful.
For instance, our entity resolution systems combine traditional probabilistic record linkage techniques with transformer-based models that can understand contextual relationships between entities. The hybrid approach gives us the best of both worlds: the interpretability and stability of classical methods with the powerful interpretive capabilities of modern AI.
Similarly, we've deployed AI-focused approaches in ways that augment our human teams and processes. We've built custom AI agents that evaluate data quality, suggest improvements, and fix issues faster than would be possible with human intervention alone. These systems compound our ability to rapidly improve data quality and build new features.
Even the best statistical models make mistakes when data is hard to interpret. My personal favorite example: the two completely independent Giant Supermarket chains that operate in adjacent states but have no corporate relationship whatsoever.
To address these edge cases, we've built a mechanism to establish our own definitive set of facts about businesses. At the core of Enigma's product is a database that allows us to assert facts that refine our statistical models and AI agents.
We've extended this capability to our customers, giving them the ability to suggest corrections when our data is wrong or incomplete. We review these suggestions daily and incorporate valid corrections within a seven-day window. This creates a virtuous cycle where our data accuracy continually improves, focused on the areas that matter most to our customers.
Our journey has been replete with novel technical challenges and surprising learnings. In the coming months, we’ll dive deeper into several of these topic:
Our goal is to share a candid account of our journey–particularly where we made mistakes and where our initial hypotheses turned out to be incorrect. Engineering is messy, filled with false starts and unexpected revelations. So expect an unsanitized version.
If you're an engineer, data scientist, or product builder working on complex data problems, I’m optimistic that you'll find something valuable in these posts—whether it's a technical approach you can adapt or simply the reassurance that other highly talented teams struggle with challenges similar to the ones you may be facing.
As we put the finishing touches on our initial rebuild, we're transitioning to a new phase focused on expanding the reach and impact of this work. We're developing industry-specific extensions, enhancing our API capabilities, and deepening our integration with workflow tools where business decisions are made.
The ultimate measure of our success is the value it delivers to our customers: helping lenders make better credit decisions for small businesses, enabling software companies to serve the middle market more effectively, and giving businesses themselves better insights about their competitive landscape.
If you're working on problems where an accurate understanding of US businesses is critical, I'd love to connect.
Ryan Green is the Chief Technology Officer at Enigma, where he leads the development of data products that bring clarity to the complex world of private businesses.