Correction 9:07 a.m. PT: An earlier version of this story stated that Databricks, and DBT Labs were investors in Datafold when in fact Datafold incoming board member and general partner at NEA Peter Sonsini also is an investor in Databricks. Amplify Partners, the other Datafold investor, invested in DBT Labs.
Datafold, a startup that automates workflows and maintains data quality, today announced it has raised $20 million in a series A round of funding, led by NEA (New Enterprise Associates). The investment, which also saw participation from Amplify Partners, will be used by the company to further develop its data reliability platform and expand its team.
For any data-driven organization, ensuring the quality of data pipelines on a day-to-day basis is the key to having well-functioning dashboards, properly trained AI and ML models, and accurate analytics. However, with an explosion in the variety and volume of data as well as increasing requirements to deliver data products faster, data engineers using manual methods of testing, monitoring, and quality assurance often find themselves struggling. They fail to keep up with the complexity.
Solution to ensure high-quality data pipelines
Founded in 2020, Datafold strives to solve these challenges and prevent data catastrophes with its end-to-end reliability platform. The solution automates multiple tedious workflows in the process of developing data products, starting from finding high-quality data to testing changes/fixes before deploying them into production and monitoring data pipelines already in production.
â€œDatafold provides pretty much a unified data catalog that enables data developers to find relevant datasets from a bunch of thousands and instantly assess how they work, meaning see distributions of data in every column, the quality metrics (whether a given column is populated or mostly nulled) and the lineage of the dataset,â€ Gleb Mezhanskiy, the founder and CEO of Datafold, told Venturebeat.
Companies like Bigeye and Monte Carlo also operate in the area of ensuring data reliability, although Mezhanskiy said that most of these and other solutions set up internally by large organizations are focused on detecting issues when the data pipeline is in production. As a result, by the time the team learns about the broken data, the damage is already done, with executives making decisions based on wrong dashboard numbers or ML models trained with bias.
Datafold, on the other hand, focuses on proactively identifying data anomalies before they go into production and do the damage. The solutionâ€™s flagship feature, Data Diff, automates data testing in the change management workflow and integrates it in the CI/CD process and code repositories. This shows data practitioners how a change in the data processing code will impact the resulting data and downstream products, such as BI dashboards, allowing them to catch issues that could stem from a hotfix/change before the code reaches production and the data is computed.
â€œBefore using Datafold, our customer teams would be spending multiple hours [on] the same task. But, with our tooling, it takes them about five minutes. So itâ€™s a massive, massive acceleration of testing,â€ Mezhanskiy emphasized while noting that the company works with a â€œfew dozen customersâ€ and helps them ensure 100% code testing.
In addition to this, much like its competitors, the company also leverages machine learning to monitor and detect failures in old data products and pipelines that are already in production.
â€œWe basically profile the data, compute the metrics, run them against our machine learning model, and answer the question of whether the data behaves as expected. If it doesnâ€™t, we alert the customer over slack or any other channel,â€ the CEO said.
Some of the prominent customers roped by Datafold include Patreon, Thumbtack, Faire, Dutchie, Amino, Truebill, and Vital.
The road ahead for data reliability
Moving forward, Datafold plans to advance its product, expanding its ability to automate more of the checks and tests data engineers do. The company believes that more than 80% of what data engineers do could be automated.
Along with this, it also plans to launch a smart-alerting feature that will prioritize data anomalies, helping teams decide what issues are the most critical and need to be addressed first. The feature is currently being tested with a select few customers.
In the near term, Datafold expects these improvements to register fivefold growth. The company will also expand its team to 40 or more by the end of next year.