Entrobit TeamApril 16, 20267 min read

Membership Inference and Re-identification Attacks: Stress-Testing Your Synthetic Data Pipeline

Published by Entrobit · April 2026

"It's Synthetic, So It's Private." Wrong.

There's a dangerous assumption floating around the synthetic data community: because records are generated from a model rather than copied from a database, they must be private. This assumption is wrong, and acting on it has real consequences.

Stadler, Oprisanu, and Troncoso put this to rest at USENIX Security 2022. Their experiments showed that naive synthetic data generators (those without formal privacy guarantees) can be nearly as vulnerable to privacy attacks as the original data. Membership inference attacks against standard non-DP generators achieved success rates far above random chance, sometimes approaching what's achievable against the original dataset.

That finding should change how you think about synthetic data privacy. Generation is not anonymization by default. It becomes anonymization only when the generation process includes mechanisms that provably limit information leakage, and when those limits are verified through adversarial testing.

Attack 1: Membership Inference

The Question

Was a specific individual in the training data used to generate this synthetic dataset?

This matters because membership itself can be sensitive. If the training data is a cancer registry, knowing someone was in it reveals their diagnosis. If it's a dataset of loan defaults, membership reveals a financial event.

How the Attack Works

The classic approach from Shokri et al. (IEEE S&P 2017) uses shadow models.

Build shadow datasets that approximate the target training data's distribution, using public data, auxiliary sources, or distributional assumptions.

Train shadow generators (ideally the same architecture as the target) on each shadow dataset and generate shadow synthetic data. You know ground truth here: which records were in each shadow training set, which weren't.

Train an attack classifier that takes a record and a synthetic dataset as input and predicts membership. Features typically include the distance from the target record to its nearest synthetic neighbor, local synthetic density around the target, and the record's likelihood under a model fit to the synthetic data.

Execute. Apply the trained attack model to the target synthetic dataset with query records.

Reading the Results

Attack accuracy at 55% is leaking some information. At 70-80%, it's a serious privacy failure. At 90%+, the synthetic data provides almost no protection.

The metric that matters most: True Positive Rate at a fixed 1% False Positive Rate. This focuses on the attacker's ability to confidently identify members, which is the most dangerous scenario.

Attack 2: Attribute Inference

The Question

Given knowledge of someone's non-sensitive attributes (quasi-identifiers), can you infer their sensitive attribute from the synthetic data?

Example: an attacker knows a person's age, zip code, and occupation and wants to infer their HIV status. If the synthetic data shows that people matching those quasi-identifiers overwhelmingly have a positive status, the attacker can infer the target's status with high confidence.

Why It Works

The risk scales with quasi-identifier uniqueness. Sweeney's 1997 finding that 87% of the U.S. population can be uniquely identified by zip code, birth date, and sex illustrates how few quasi-identifiers are needed. If synthetic data faithfully preserves fine-grained conditional distributions and the quasi-identifier combination is nearly unique, attribute inference becomes straightforward.

Attack 3: Linkage

The Question

Can records in the synthetic dataset be linked to records in an external database (voter registration, public records, social media profiles)?

How It Works

Identify variables present in both the synthetic and auxiliary datasets. Search for synthetic records that closely match auxiliary records on these linking variables. A confident match lets you attribute the synthetic record's non-linking variables (potentially sensitive) to the individual identified through the external source.

Linkage attacks are particularly dangerous because the attacker doesn't need to know the training data. They just need an overlapping external dataset. Faithful reproduction of the joint distribution also faithfully reproduces the patterns that make records unique and linkable.

The Red Team Checklist

Every synthetic dataset should face adversarial evaluation before deployment. Here's the practical checklist.

1. Distance to Closest Record (DCR)

For each synthetic record, compute the distance (Euclidean on normalized features, typically) to the nearest real record. Plot the distribution.

Red flag: Very small DCR values (effectively duplicating real records) or average DCR smaller than the average inter-record distance in the real data. The generator is memorizing.

Healthy: DCR distribution similar to or larger than inter-record distances in the real data.

2. Nearest Neighbor Distance Ratio (NNDR)

Ratio of each synthetic record's distance to its nearest real record, divided by that real record's distance to its next-nearest real neighbor.

Red flag: NNDR values near zero. Synthetic records are much closer to real records than real records are to each other. Memorization signal.

Healthy: NNDR centered around 1.0.

3. Membership Inference Attack

Use the shadow model approach described above, or Synthcity's built-in membership inference evaluation. Measure attack success.

Red flag: Accuracy significantly above 50%, especially with high TPR at low FPR.

Healthy: Accuracy near 50%.

4. Attribute Inference Test

Select quasi-identifier columns and a sensitive attribute. For each unique quasi-identifier combination in the training data, query the synthetic data for matches and predict the sensitive attribute. Compare against a baseline using only public information.

Red flag: Prediction accuracy significantly above the public-information baseline.

Healthy: No significant improvement from the synthetic data beyond what public information provides.

5. Outlier Memorization Check

Do training data outliers appear as outliers in the synthetic data? Extreme values (very high income, rare diagnoses, unusual demographics) are the most re-identifiable records. If synthetic data reproduces them faithfully, it may be disclosing the individuals behind them.

Red flag: Training data outliers replicated in synthetic data with matching values.

Healthy: Synthetic data smooths outliers or generates them in different locations.

6. Holdout Comparison

Run the same distance and membership inference analyses on a holdout set (same distribution, not in training). Attack metrics for training members vs. holdout records should be similar.

Red flag: Training members significantly more identifiable than holdout records.

Healthy: No significant difference.

Why Differential Privacy Is the Only Provable Defense

Every attack above exploits the same underlying problem: a generator that learns the training data too well encodes information about specific individuals in its outputs.

Differential privacy with parameter ε guarantees that the probability of any synthetic output changes by at most e^ε when any single record is added or removed. Membership inference can't achieve much above 50% because the output would look nearly identical either way.

The formal guarantee protects against all attacks simultaneously. Not just the ones listed above, but any attack anyone could ever devise, including techniques not yet invented. That universality is what separates DP from empirical testing. Empirical tests check for specific known attacks. DP protects against everything.

The two approaches complement each other. DP provides the provable foundation. Empirical testing validates that the implementation is correct and the protection works in practice.

What If You Don't Need DP?

Not every deployment requires formal guarantees. For internal use cases where synthetic data stays in a controlled environment, or for data with low sensitivity, the overhead may not be justified.

In those cases, the red team checklist provides empirical privacy assessment. If membership inference hovers near 50%, DCR values look healthy, and attribute inference tests pass, the synthetic data provides practical privacy, just without a formal guarantee.

But understand the limitation: empirical assessment protects against the attacks you tested. It says nothing about novel attacks. If the data is genuinely sensitive or leaves the building, DP is the defensible choice.

Practical mitigations for non-DP generators: limit training epochs (reduces memorization), add regularization (dropout, weight decay), post-process to remove synthetic records with very small DCR, cap extreme values to reduce outlier memorization. These help. They don't substitute for formal guarantees.

Defense in Depth

The strongest approach layers defenses.

Layer 1: DP during generation. The mathematical foundation.

Layer 2: Empirical evaluation. The red team checklist after every generation run. Validates the implementation.

Layer 3: Access controls. Even with DP, limit access on a need-to-know basis.

Layer 4: Monitoring. Track downstream usage. If sharing scope changes, reassess.

Platforms with built-in privacy evaluation and configurable DP mechanisms provide this as integrated capability: generation, privacy mechanisms, evaluation, and documentation in a single governed workflow. Nothing gets skipped. Every dataset's privacy posture is documented and auditable.

Synthetic data privacy isn't automatic. It's earned through mechanism design, rigorous evaluation, and provable guarantees. The attacks are real and well-documented. The defenses are real and available. The synthetic data that's safe is the synthetic data that's been tested.

References: Stadler, Oprisanu & Troncoso (2022), Synthetic Data — Anonymisation Groundhog Day, USENIX Security; Shokri et al. (2017), Membership Inference Attacks Against Machine Learning Models, IEEE S&P; Sweeney (1997); SmartNoise attack notebooks; Synthcity privacy evaluation metrics.