Data managers beware - synthetic data still has limitations

February 16, 2022

Dyslexia mode

dominos fall

(via Shutterstock.com)

I've posted two articles about synthetic data, both highly skeptical about the hype surrounding it. In Can synthetic data bridge the AI training data gap? (July 2021), I wrote:

The problem of ineffective or biased AI training data persists. Can synthetically-generated data alleviate this or complement actual data? It's not a simple question, but it's one we need to assess.

I revisited the topic in Synthetic data for AI modeling? I'm still not convinced (October 2021). Articles that delve into synthetic data raise questions. For example: The Advantages and Limitations of Synthetic Data, by Marcello Benedetti. At the beginning of the article, Benedetti claims I don't understand:

Synthetic data is system-generated data that mimics real data, in terms of essential parameters set by the user. Synthetic data is any production data not obtained by direct measurement and is considered anonymized.

What are "essential parameters set by the user"? That sounds like a priori judgment (bias) about what is essential.

As I continue to research and monitor synthetic data, some of those concerns are partially allayed. There are lots of issues to discuss, but the two that stand out for me are:

There seems to be some confusion in the messaging about use cases. On the one hand, every document I read mentions the useful application of synthetic data engineering to protect against unlawful identification of sensitive data.

The other prominent use case is to create data either from what an expert already knows or from partial real data and supplement it with AI-generated data for purposes that are still a little murky. The repeated claim in the industry is that "the generated data has the identical mathematical and statistical properties as the real-world dataset it is replacing."

Anonymization of sensitive data

In correspondence with Alexandra Ebert, Chief Trust Officer of the Austrian company MOSTLY AI, she said:

One of the main reasons for the hype around synthetic data is that it resolves the privacy-utility-trade off of legacy anonymization techniques.

She is referring to masking, encrypting or deleting PII (Personally Indefinable Information)

"…and thus doesn't come with the well-known drawbacks."

Which are that they don't work. It is too easy for a bad actor to defeat the scheme or a reasonable actor to expose protected information inadvertently. When solving human problems, it helps to have human data, not clumps of them. Consider medical research. Masking or deleting any Personally Identifiable Information weakens the dataset by removing features relevant to the investigations.

However, changing a dataset to immunize it against unlawful disclosure of personal information is not a new idea. In fact, in some cases, the lack of the PII can render a dataset useless. A researcher may need to have that for their experiments. Another technique, "differential privacy," which I wrote about here, has nothing to do with generating synthetic data. It preserves the sensitive data by introducing noise resolved through a probabilistic model. I've seen this technique used effectively in applications of Federated Learning. I can see in principle how synthetic data can solve some disclosure problems, but I don't yet understand the technique - or the limitations.

Generating synthetic data: Claim - the generated data has the identical mathematical and statistical properties as the real-world dataset it is replacing.

Vendors claim that their solutions perform better than real-world data by correcting bias typically in datasets, particularly those containing history. That seems like an impossible problem unless it's limited to instances of particularly offensive data. Bias in data often arises mysteriously as the model draws unintended relationships.

Synthetic data seems to suggest skipping the data wrangling and just creating your data which raises the following questions:

The premise of ML is that the answer is in the data.
If you create your data, aren't you already telling the model the answer?
If you create your data, by whatever means, aren't you just as likely to insert your biases?
What do you mean it's "computer-generated?"
What exactly is meant by "identical mathematical and statistical properties

I'm open to the idea because AI applications are still not generating the value we'd hoped for, and data is a big part of that. We've spent years and billions of dollars trying to "clean" our data. Using generative models makes some logical sense depending on which type is used: Generative Adversarial Networks or GAN's, Variational Auto Encoders or VAE's, and Autoregressive models, but it doesn't explain how they operate. And the term "closely resembles" still troubles me. Where does "closely resembles" fit in?

Why train a new system with data that is already consistent with the statistical profile of the actual data? Would it not be sufficient to merely run your model on what you have, since it doesn't appear likely the synthetic data will add any new insight? This is still an open question for me, though I suspect there is a good argument for it.

Plausible synthetic data use cases?

The usage of synthetic data has been proposed for computer vision applications and learning to navigate environments by visual information. In particular, object detection, where the synthetic environment is a 3D model of the object.

In an article in Fortune, Why synthetic data is such a hot topic in the artificial intelligence world, John Deere used synthetic images of plants in different weather conditions to improve its computer-vision systems that will eventually be used on autonomous tractors to spot weeds, such as witchweed, and spray weed killer on them. (I'm not going to go into how much I hate the idea of autonomous tractors spraying weed-killer). There is a particular weed that their object-identification software recognized accurately only in bright sunshine, so they created millions of copies (digitally, of course) of the weed in different weather conditions and retrained their model to excellent results.

But let's not forget the humans' "bias" of what the weed would look like, and introduce that into the training. Is that legitimate?

Is there any danger in this? Could it drive the tractor on a rampage through Des Moines, Iowa, spraying every person who looks like itchweed? I suppose there is the possibility of false positives resulting in over spraying or false negatives, leading to poor corn crop yield, I don't know.

Data Generation

From The Executive's Guide to Accelerating Artificial Intelligence and Data Innovation with Synthetic Data in Harvard Business Review, Tobias Hann, CEO of MOSTLY AI writes:

Synthetic data is a tool that addresses many data challenges, particularly AI and analytics issues such as privacy protection, regulatory compliance, accessibility, data scarcity, and bias as well as data sharing and time to data (and therefore time to market).

Its use is most advanced in industries such as banking, insurance, health care, and telecommunications as well as parts of the public sector, where customer data is susceptible and subject to strict regulation.

Erste Group Bank AG, MOSTLY AI customer, is considering this scenario:

We could take a fraud case using synthetic data to exaggerate the cluster, exaggerate the number of people, and so on, so the model can be trained with much more accuracy," he says. "The more cases you have, the more detailed the model can be."

Notice he said "considering."

Slawek Kierner, senior vice president for enterprise data and analytics at Humana, says:

Humana has roughly 17 million members, and there's a treasure trove of information about your health care. …Using synthetic data to train AI is transforming the way Humana provides care. "With AI, we can predict what will happen with our members' health," Kierner says. "We created accurate models that predict the progression of a disease to the degree that we can predict when you will need an ER, so that we can be two steps ahead and help our members prevent emergencies.

Go back and read EPIC's disastrous attempt to do the same with their Deterioration Index. I do not believe the above paragraph.

Svetlana Sicular, a research vice president at Gartner, says:

Synthetic data, on a philosophical level, relieves AI from the limitations of looking only at the past and learning from the past data. With synthetic data, you can dream up the future, create the data that you think might come in the future, and create the models to deal with the future."

With synthetic data, you can dream up the future, create the data that you think might come in the future, and create the models to deal with the future.

Show me the equations!

The closest I've come to either a mathematical or algorithmic description for synthetic data is in GitHub"

The open source Synthetic Data Vault (SDV) is a Synthetic Data Generation ecosystem of libraries that provides some very useful features (if they perform): single-table, multi-table and time-series datasets "to later on generate new Synthetic Data that has the same format and statistical properties as the original dataset.

That seems like a contradiction. Does it or will it?

About the Synthetic Data Vault:

The goal of this paper is to build a system that automatically creates synthetic data to enable data science endeavors. To achieve this, we present the Synthetic Data Vault (SDV), a system that builds generative models of relational databases. We are able to sample from the model and create synthetic data, hence the name SDV."

It's partially helpful in understanding the process, but at crucial points, it applies the actual driver, SDV, and I haven't been able to locate a description of that, though it describes the use of Models:

That's all the detail could find at the nuts-and-bolts level, though the names hint at the use of GANs (Modeling Tabular data using Conditional GAN).

When implementing the SDV:

We also developed an algorithm that computes statistics at the intersection of related database tables. We then used a state-of-the-art multivariate modeling approach to model this data. The SDV iterates through all possible relations, ultimately creating a model for the entire database. Once this model is computed, the same relational information allows the SDV to synthesize data by sampling from any part of the database.

After building the SDV, we used it to generate synthetic data for five different publicly available datasets. We then published these datasets, and asked data scientists to develop predictive models for them as part of a crowdsourced experiment. By analyzing the outcomes, we show that synthetic data can successfully replace original data for data science. Our analysis indicates that there is no significant difference in the work produced by data scientists who used synthetic data as opposed to real data. We conclude that the SDV is a viable solution for synthetic data generation.

About the last two sentences - how does one follow from the other?

Perhaps the authors on Wikipedia are more certain about synthetic data:

Synthetic data is increasingly being used for machine learning applications: a model is trained on a synthetically generated dataset with the intention of transfer learning to real data. Efforts have been made to construct general-purpose synthetic data generators to enable data science experiments. In general, synthetic data has several natural advantages:

once the synthetic environment is ready, it is fast and cheap to produce as much data as needed;

synthetic data can have perfectly accurate labels, including labeling that may be very expensive or impossible to obtain by hand;

the synthetic environment can be modified to improve the model and training;

synthetic data can be used as a substitute for certain real data segments that contain, e.g., sensitive information.

My take

After Mighty AI is was acquired by Uber, their post on synthetic data and generative models was taken down, but you can find it in the Internet archive:

Many in-the-know are hoping that, in the future, we'll get to a place where a whole bunch of synthetic examples from generative models plus a small number of real examples can train a system to the same level of performance as a large number of real examples. A lot of the literature on this type of synthetic data suggests we could eventually be generating artificially large datasets for "pre-training" - one could train a machine learning system up on reasonable-looking synthetic data to get it into a "reasonable" starting point to begin "fine-tuning" it on a much smaller amount of real data. The idea being that it'll have learned enough from the synthetic data to use the real data more efficiently.

Important note: synthetic data can augment real datasets, but cannot replace them. Even outside of use cases where the training data must be all-authentic, generative models themselves will always need plenty of real data to learn how to produce synthetic examples in the first place. No model will ever be able to generate examples of things it's never seen real examples of before.

The days of widely implemented "synthetic data for pre-training" approaches are still far off, and the jury is obviously still out on which applications the technique could be useful for. Interesting stuff, nonetheless, and we'll, of course report back as the field makes advances.

Since my first article, I'm beginning to see a plausible use case for dealing with PII, and the John Deere application was very interesting. There is a ton of material online about synthetic data, most of which is from vendors and journalists, and very little raising the questions I've raised. However, the bulk of machine learning development is around more mundane enterprise models. I'm still not convinced that drawing inference from data you invented is a good idea, but watch this space.

Data managers beware - synthetic data still has limitations

Data managers beware - synthetic data still has limitations

Anonymization of sensitive data

Plausible synthetic data use cases?

My take

Recommend

Web 3信息基础设施协议RSS3公布空投细节

JavaScript之注释规范化（JSDoc）

数字资产公司 Castle Island 获2.5亿美元融资，将投资加密初创公司

NEAR生态去中心化交易平台Ref Finance完成480万美元融资，Jump Crypto领投

The supply chain data gap must be closed - Oracle explains why its new logistics...

Web 3项目Backdrop完成170万美元融资，Coinbase等参投

加密初创公司Multis宣布完成700万美元融资，红杉资本领投

环球音乐集团与NFT平台Curio达成合作，将为唱片公司和艺术家开发NFT系列

域名注册服务商Namecheap宣布收购Handshake开发商Namebase

元宇宙游戏Bullieverse宣布完成400万美元融资，Spark Digital Capital等参投

About Joyk