Covered:
Challenges with AI/ML models
What is synthetic data?
How is synthetic data generated?
Tools for generating synthetic data
Why synthetic data?
Popular use cases for synthetic data
Industry use cases
Synthetic data startups
Current Challenges With Generating AI/ML Models:
Developing successful AI Models requires access to large amounts of data with strong integrity. However, collecting large amounts of data with equally strong integrity is not exactly a frictionless process and has many challenges.
Sensitive User Data
Many of the use cases involving business problems that AI could solve require access to customer data that is considered sensitive. Collecting highly sensitive user data can raise privacy concerns. Collecting sensitive user data can also make the business vulnerable to being targeted for data breaches. As a result, privacy regulations have been put into effect, such as GDPR (General Data Protection Regulation) and CCPA (California Consumer Privacy Act), that restrict the collection and use of personal data.
Data is Expensive
Data can be expensive to collect. Data is one of the most valuable commodities in the world; therefore, there are significant price premiums on purchasing data sets – and if you’re not buying the data, the cost of labor and resources to collect robust data can also be costly and timely.
Scarce Data Sets
Some data sets that are needed to build out AI and ML (machine learning) models can be scarce. For example, bank fraud is considered a rare event. Therefore, collecting enough data to develop ML models to forecast fraudulent transactions is challenging because fraudulent transactions are irregular.
If the A.I. industry is going to succeed, there needs to be a solution to these issues. The solution is synthetic data.
What Is Synthetic Data?
Synthetic Data is data that is artificially generated rather than being generated by actual events. It’s a type of data augmentation (a set of techniques that artificially increase the amount of data by generating new data points from existing data). Synthetic data is developed through a variety of algorithms and has been around since the 90s. However, because of the scale of computing power and storage space in the last 20 years, its use has become more widespread.
Garter projects that synthetic data will completely overshadow real data in AI models by 2030.
How is Synthetic Data Generated:
As mentioned above, synthetic data is artificially generated and developed through various algorithms, depending on the situation.
Below is an example of J.P. Morgan’s process for generating synthetic financial datasets:
Step 1 – Compute metrics for the real data.
Step 2 – Develop a Generator (this may be statistical methods or an agent-based simulation).
Step 3 – Calibrate the Generator using the real data (optional).
Step 4 – Run the Generator to generate synthetic data.
Step 5 – Compute the metrics for the synthetic data.
Step 6 – Compare the metrics of the real data and synthetic data.
Step 7 - Refine the Generator to improve against comparison metrics.
The preference on the methods for developing synthetic data can vary, mostly depending on the business. Some preferred methods are decision trees, deep learning techniques, and iterative proportional fitting.
The chosen method depends on the requirements of the synthetic data and the level of data utility.
After the synthetic data is produced, it needs to be compared with the real data sets. This assessment is broken down into two phases:
General-purpose comparisons – comparing the parameters such as distributions and correlation coefficients.
Workload-aware utility assessment – Comparing the accuracy of outputs.
Tools for Generating Synthetic Data:
GPT-J – Open-source alternative to OpenAI’s GPT-3 text generation tool.
Synthea – Open-source tool popular in the medical field
Scikit-learn – Used to generate data sets for regression, clustering, and classification to produce data sets that can enable predictions.
SymPy – Used by data scientists who need more custom synthetic data sets for more specific needs, as it enables the creation and development of custom symbolic expressions.
pydbgen – Used to generate standard data sets, such as phone numbers or email addresses.
synthpop – an R package used to generate synthetic demographic data.
faker – A python package that can generate synthetic data such as names, addresses, emails, Social Security numbers, etc.
SDV – A python tool for generating tables, relational databases, and time-series models.
Why Synthetic Data?
Synthetic data is a crucial component of developing more robust A.I. and ML models for various reasons. An essential quality of synthetic data is that it can be generated to meet specific demands or environments not available in existing data from actual events. This is advantageous in use cases such as:
When the privacy requirements limit data availability or how it can be used.
When non-existing data is needed for testing.
When training data is needed for ML algorithms - think data required for self-driving cars, which is both expensive and complicated to generate (you can’t just throw a self-driving car on 405 in Los Angeles on a weekday morning to collect data on the performance).
Popular Use Cases of Synthetic Data:
Data Sharing with 3rd Parties:
Companies in various sectors rely on partnering with 3rd parties for data sharing reasons. This is especially the case with Fintech and Healthtech. Synthetic data enables these companies and 3rd parties to share data that is just as valuable as the original data set while avoiding security or compliance risk.
Internal Data Sharing:
There are also restrictions for internal data sharing between different teams inside an organization. This obviously makes innovation inside a company inefficient. Synthetic data can help enable the ability to speed up innovation.
Cloud Migrations:
Moving private data to different cloud infrastructures involves security and compliance risks. Synthetic data can fill in where there is the data is the riskiest to move over.
Lack of Historical Data:
In some cases, there is a limited amount of historical data to study or feed an ML model. Some examples of this could be a flash crash in the market, recessions, and new regimes of behavior. Without significant data on these types of anomalies, it can be challenging to study what underlying mechanisms could cause them in the future and how to avoid them. Using synthetic data can help provide stronger data sets with adequate sample sizes.
Eliminating Biases:
Real-world data often reveals racism, sexism, and other biases that tend to mirror society. The algorithms used to train ML and A.I. models can learn to increase these biases, in some cases on an extensive scale. The use of synthetic data can help conduct more thorough audits of the models and even weight the data for a more neutral environment with fewer biases.
“One of the biggest challenges that fairness practitioners with big organizations face nowadays is that they actually don’t even know whether the systems are exhibiting bias or not. Reason being is that to know whether you’re discriminating against a certain user group, you need to know in which group a given user falls, and in many regulatory environments, it’s prohibited to use data about ethnicity, gender, and so on and so forth. So they’re kind of operating in the blind.” – Alexander Ebert, Chief Trust Officer, Mostly AI
Industry Use Cases:
Financial Sector
Fraud identification – Using synthetic fraud data can help financial services create and test new detection methods.
User analytics – Synthetic transaction data can analyze consumer behavior (again, synthetic data helps organizations stay within the regulatory rules and privacy mandates).
Manufacturing
Quality assurance – Synthetic data enables more effective testing of quality control systems – especially in a world like manufacturing, where it’s hard to identify anomalies because manufacturing lives in a world of an infinite number of anomalies (such as pandemics and wars). Using synthetic data can help create models to forecast and strategize.
Healthcare
Healthcare analytics – It is widely known how important it is to maintain patient confidentiality. Synthetic data enables healthcare data professionals to use medical data (internally and externally) while still maintaining patient confidentiality.
Clinical trials – Synthetic data can be used as a baseline for future testing when there isn’t any or enough data to use in clinical trials yet.
Robotics
Autonomous Things (AuT) – AuT includes self-driving cars and autonomous robots. Synthetic data helps companies test these solutions in thousands upon thousands of simulations, improving their autonomous projects while avoiding the expensive and time-consuming process of real-life testing.
Security (Online and Offline Environments)
Video surveillance training data – To take advantage of image recognition, companies need to create and train neural network models – however, this has limitations, such as acquiring the volume of data required and manually tagging the objects. Synthetic data can help train models at a lower cost than acquiring and annotating training data.
Deep fakes – Deep fakes (a form of synthetic media data) can be used to test facial recognition systems.
Social Media
Content filtering – There is a (bloody) battle going on right now, internally and externally, about handling the fight against fake news, online harassment, and political propaganda by foreign actors. Using synthetic data for testing can help ensure the platform has flexible content filters and can identify and navigate social attacks, such as the use of troll farms.
Synthetic Data Startups:
Below is a list of startups that operate in the synthetic data industry. The list is broken up into two parts:
Structured synthetic data (tubular data)
Unstructured synthetic data (image & video)
Structured Synthetic Data Startups
betterdata
Founded: 2021
Latest Funding: Accelerator (Plug & Play Tech Center) – Jan 2021
About: vendor of privacy-preserving synthetic data solutions for AI, data sharing, or product development.
Datomize
Founded: 2020
Latest Funding: Seed - $6M (F2 Capital & TPY Capital) – Jan 2021
About: vendor of synthetic data solutions for the development, training, and testing of AI/ML models and applications.
Diveplane
Founded: 2017
Latest Funding: Angel Venture Round - $3M (Megan Rapinoe, Mia Hamm, & Sue Bird) – Sep 2021
About: Vendor of Geminai, a solution to generate synthetic ‘twin’ datasets with the same statistical properties as the original data.
Facteus
Founded: 2010
Latest Funding: Venture Series Unknown - $10M (Curql) – May 2022
About: Vendor of Mimic, a synthetic data engine to synthesize data assets that protect consumer privacy.
Gretel
Founded: 2019
Latest Funding: Series B - $52.2M (Greylock, Anthos Capital. Moonshots Capital) – Oct. 2021
About: vendor of a synthetic data generation library and APIs for developers and practitioners.
Hazy
Founded: 2017
Latest Funding: Seed - $3.5M (M12, Notion Capital, AlbionVC, Pentland Ventures) – Jan. 2020
About: vendor of a synthetic data platform for financial institutions that want to conduct data analysis.
Mostly AI
Founded: 2017
Latest Funding: Series B - $25M (Molten Ventures, Earlybird VC, Citi Ventures, 42CAP) – Jan. 2022
About: Vendor for Mostly Generate, a synthetic data generator that provides “as-good-as-real” data, yet fully anonymous data.
Unstructured Synthetic Data Startups:
DataGen
Founded: 2018
Latest Funding: Series B - $50M (Scale Venture Partners, Spider Capital, TLV Partners, Viola Ventures) – Mar. 2022
About: 3D Simulated training data provider for Visual AI learning and development.
Cognata
Founded: 2016
Latest Funding: Series B - $18M (Airbus Ventures, Emerge, Global Tech Ventures, Maniv Mobility, Scale Venture Partners) – Oct. 2018
About: provider of simulations of ADAS and Autonomous Vehicle developers.
AI Reverie
Founded: 2017
Latest Funding: *Acquired by Meta (Facebook) for an undisclosed amount – Oct. 2021*
About: provider of synthetic, simulated 3D environments.
Below is a full stack of startups that provide synthetic data. Like most of the startups mentioned above, they were all started in the last couple of years.