Sama, a company providing data to train machine learning systems, has raised $70 million in a series B found led by CDPQ with participation from First Ascent Ventures, Salesforce Ventures, Vistara Capital Partners, and existing investors. CEO Wendy Gonzalez says that the company will use the funding to grow its platform with new products that â€œenable teams to manage the complete AI lifecycle.â€
Data scientists spend about 45% of their time on data preparation tasks including loading and cleaning data, according to Anaconda. A separate report from AlationÂ found thatÂ 97% of data leaders have suffered the consequences of ignoring data, either missing out on new revenue opportunities, poorly forecasting performance, or making bad investments. Yet another study â€” this by MIT Technology Review Insights and commissioned by Databricks â€” reveals that machine learningâ€™s business impact is limited largely by challenges in managing its end-to-end lifecycle.
Founded by Leila Janah, San Francisco, California-based Sama â€” formerly Samasource â€” developed its first relationships with partner delivery centers in 2018, focusing on data entry, sentiment analysis, and data transcription. In 2009, the company launched the initial version of its technology platform, SamaHub, and embarked on a slew of commercial projects â€” including providing images and annotations used by Microsoft to build out the companyâ€™s Xbox Kinect.
â€œJanah believed that giving meaningful, living-wage work was the best way to permanently lift people out of poverty,â€ Gonzalez told VentureBeat via email. â€œTo date, weâ€™re the only AI training data provider with a responsible training and employment program that provides actionable career skills for underserved communities to bring us closer to a more equitable future of AI.â€
Today, Sama hosts a crowd-powered platform through which companies can obtain data labeled to train AI models, like videos, images, computer-generated shapes, radar, and natural language. Customers in industries such as transportation and navigation, retail and ecommerce, and robotics and manufacturing pay for datasets while â€œcrowdworkersâ€ supply annotations in exchange for payment from Sama.
Sama competes with a host of data labeling and annotation platforms in the market, including DefinedCrowd, CrowdFlower, Labelbox, Superb AI, and Scale.ai as well as incumbents like Amazon Mechanical Turk. But the company asserts that it delivers a superior product by tracking 160 million events per month to improve its platform and processes, like machine learning-assisted annotation tools for crowdworkers.
Image Credit: Sama
â€œOur labelers have three-year average tenure and are subject-matter experts who work with our customers to identify edge cases and recommend annotation best practices,â€ Sama explains on its website. â€œSampling provides feedback to quality managers to ensure teams are working efficiently and effectively, while â€˜holdâ€™ tasks and advanced scripting detect errors early in the pipeline.â€
When a company contracts with Sama, Samaâ€™s platform creates â€œmicromodelsâ€ that are used to generate prelabeled data to assist labelers with annotation. Annotators validate the machine learning-generated labels while Sama works with the company to identify edge cases and recommend annotation best practices.
Post-annotation and deployment, Sama can provide ongoing feedback and monitor models in production. Beyond this, the platform can generate data on â€œframe-levelâ€ annotation and edge cases, producing reports designed to help get models to market faster.
Supervised learning â€” one of the types of models that requires labels to train â€” is the most common form of machine learning used in the enterprise. In a recent Oâ€™Reilly report, 82% of respondents said that their organization opted to adopt supervised learning versus unsupervised (which doesnâ€™t require labels) or semi-supervised learning (which only requires a small amount of labels). And accordingÂ to Gartner, supervised learning will remain the type of machine learning that organizations leverage most through 2022.
Labels can bear the hallmarks of inequality, however. For example, an estimated less than 2% of Mechanical Turk workers come from Global South countries, with the vast majority originating from the U.S. and India. ImageNet â€” a dataset thatâ€™s been essential to recent progress in computer vision â€” wouldnâ€™t have been possible without the work of data labelers. But the ImageNet workers themselves made a median wage of $2 per hour, with only 4% making more than the U.S. federal minimum wage of $7.25 per hour â€” itself a far cry from a living wage.
Sama claims that it pays a higher annotator rate than its competitors â€” about $8 a day â€” with the mission of providing opportunities to communities in underserved regions. In a three-year randomized trial conducted by MIT and Innovations for Poverty Action, crowdworkers in Nairobi, Kenya who received both training and inclusion in Samaâ€™s hiring pool had lower unemployment rates and higher average monthly earnings in comparison to crowdworkers who only received training.
The study didnâ€™t compare the outcomes of Samaâ€™s crowdworkers with those employed with other data labeling startups. But Gonzalez says that the results â€œpoint to the indisputable factsâ€ and â€œdemonstrate the value of [Samaâ€™s] impact-model on communities globally.â€
Sama â€” which employs 120 full-time workers and 3,500 annotators â€” has customers in Google, Nvidia, GM, Walmart, Getty, and over 25% of the Fortune 50. Its crowdworkers annotated 1.5 billion data points in 2020 alone, and with the latest funding round, Samaâ€™s total capital raised stands at nearly $85 million.
â€œOur customers include Fortune 2000 companies,â€ Gonzalez said. â€œNotably, Samaâ€™s â€¦ training data was recently tapped by Google to power its AI algorithm for Project Guideline, which helps those with visual impairments run independently. With our high-quality, accurate training data, the application is able to accurately approximate the runnerâ€™s position and provide audio feedback so the runner can self-correct. Now, weâ€™re working to scale Project Guideline with a goal of making the solution an accessible option for the blind [and] visually impaired community.â€