Skip to content

Synthetic Data Outliers: Navigating Identity Disclosure Mahmoud Ghorbel Artificial Intelligence Category – MarkTechPost

  • by

​[[{“value”:”

Synthetic data creation uses sophisticated algorithms like GANs, VAEs, or diffusion models to generate imitation datasets that mimic the statistical characteristics of real-world data. Traditional anonymization methods might not be able to solve privacy issues like breaches or re-identification, but synthetic solutions do. Even with rigorous data-sharing rules, organizations can maintain data usefulness for activities like machine learning while fostering creativity and teamwork. 

Recent research on synthetic data privacy has explored methods like GAN-based, Marginal-based, and Workload-based approaches, with Marginal-based methods better performing preserving data properties. Differential privacy techniques are often used to reduce re-identification risks, balancing privacy and utility. Tools like TAPAS and frameworks like Anonymeter assess vulnerabilities to privacy attacks, but concerns remain about maintaining data accuracy while ensuring privacy protection. However, the resemblance between synthetic and original data can introduce privacy risks, particularly through re-identification. Unique or rare data points (outliers) are especially vulnerable, as they may inadvertently carry identifiable traits from the original dataset. While synthetic data is considered a robust privacy solution, much research overlooks these risks, highlighting the need for additional safeguards to ensure true privacy protection.

To address this gap, an American-Portuguese research team recently published a paper analyzing the privacy risks of synthetic data concerning outliers. Their findings reveal that re-identifying outliers through linkage attacks is feasible and easily achievable. Furthermore, they demonstrate that additional safeguards, such as differential privacy, can mitigate re-identification risks but often come at reduced data utility costs.

The research team followed a comprehensive methodology to evaluate the privacy and utility of synthetic data, focusing on outlier re-identification risks. They used the Credit Risk dataset, from which they generated synthetic data using deep learning models (TVAE, CTGAN, CopulaGAN) and differential privacy models (Independent, PrivBayes, DPsynthpop) to create 102 synthetic variants. The utility of the synthetic data was assessed using SDMetrics, which measured aspects like boundary adherence, category coverage, range coverage, and statistical similarity between the original and synthetic datasets. To evaluate privacy, the team performed a linkage attack by identifying outliers using the z-score method and then attempting to link synthetic data points with the original data based on quasi-identifiers. Record linkage techniques were used to assess potential matches, including the Gauss method for numerical data and the Levenshtein method for categorical data. The results were filtered and aggregated to determine how easily synthetic data could be re-identified, particularly focusing on outliers.

The study found that differential privacy-based models like DPsynthpop had lower data utility, especially regarding attribute coverage and statistical similarity, but generated fewer outliers. In contrast, deep learning models produced higher-quality data but had more outliers.

Linkage attacks revealed that re-identification was possible, with deep learning models posing a higher privacy risk due to more potential re-identifications than differential privacy-based models. The study also showed a trade-off between privacy and data quality. Differential privacy protection compromised data quality, while deep learning models improved data quality but increased the risk of re-identification, especially with more training epochs.

In summary, the study analyzed the efficacy of re-identification risks related to models for creating synthetic data, emphasizing protecting extreme data points or outliers. The results showed that outlier protection depended on the model. While the differential privacy-based models produced more outliers at the cost of data quality, the deep learning-based models produced more frequent values. To demonstrate the weaknesses in synthetic data, the study team also carried out a linkage attack, showing how outliers may be used to re-identify personal information.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI WEBINAR] Implementing Intelligent Document Processing with GenAI in Financial Services and Real Estate Transactions

The post Synthetic Data Outliers: Navigating Identity Disclosure appeared first on MarkTechPost.

“}]] [[{“value”:”Synthetic data creation uses sophisticated algorithms like GANs, VAEs, or diffusion models to generate imitation datasets that mimic the statistical characteristics of real-world data. Traditional anonymization methods might not be able to solve privacy issues like breaches or re-identification, but synthetic solutions do. Even with rigorous data-sharing rules, organizations can maintain data usefulness for activities
The post Synthetic Data Outliers: Navigating Identity Disclosure appeared first on MarkTechPost.”}]]  Read More AI Paper Summary, AI Shorts, Applications, Artificial Intelligence, Editors Pick, Staff, Tech News, Technology 

Leave a Reply

Your email address will not be published. Required fields are marked *