What is synthetic data — and how can it help you competitively?

Why it Matters: Synthetic data — which resembles real data sets but doesn’t compromise privacy — allows companies to share data and create algorithms more easily.

Companies committed to data-based decision-making share common concerns about privacy, data integrity, and a lack of sufficient data.

Synthetic data aims to solve those problems by giving software developers and researchers something that resembles real data but isn’t. It can be used to test machine learning models or build and test software applications without compromising real, personal data.

A synthetic data set has the same mathematical properties as the real-world data set it’s standing in for, but it doesn’t contain any of the same information. It’s generated by taking a relational database, creating a generative machine learning model for it, and generating a second set of data.

The result is a data set that contains the general patterns and properties of the original — which can number in the billions — along with enough “noise” to mask the data itself, said Kalyan Veeramachaneni, principal research scientist with MIT’s Schwarzman College of Computing.

Gartner has estimated that 60% of the data used in artificial intelligence and analytics projects will be synthetically generated by 2024. Synthetic data offers numerous value propositions for enterprises, including its ability to fill gaps in real-world data sets and replace historical data that’s obsolete or otherwise no longer useful.

“You can take a phone number and break it down. When you resynthesize it, you’re generating a completely random number that doesn’t exist,” Veeramachaneni said. “But you can make sure it still has the properties you need, such as exactly 10 digits or even a specific area code.”

Synthetic data: “no significant difference” from the real thing

A decade ago, Veeramachaneni and his research team were working with large amounts of student data from an online educational platform. The data was stored on a single machine and had to be encrypted. This was important for security and regulatory reasons, but it slowed things down.

Click here for full article…

Similar Posts