Measuring What Matters: A Framework for Evaluating Synthetic Data Quality
Published by Entrobit · April 2026
Most Teams Skip This Step. Don't.
Generating synthetic data has never been easier. A few lines of code, a few clicks in a platform, and you've got a dataset. But generating synthetic data and generating good synthetic data are separated by evaluation, and too many teams never cross that gap.
I've seen data teams eyeball a couple of histograms, declare the synthetic output "looks fine," and push it to downstream consumers. This is the equivalent of deploying an ML model without ever testing it on held-out data. The consequences are predictable. Models trained on bad synthetic data underperform. Privacy guarantees get assumed rather than verified. Stakeholders lose trust in the entire program.
Here's the evaluation checklist that should run before synthetic data gets used for anything. Four dimensions: distributional fidelity, downstream utility, privacy, and fairness.
Distributional Fidelity
Does the synthetic data preserve the statistical properties of the original? This question has layers.
Marginals
The simplest check. Does each column have the same distribution in synthetic and real data? Compare histograms or kernel density estimates for continuous columns. Compare frequency tables for categorical ones. Kolmogorov-Smirnov, chi-squared, total variation distance: take your pick.
But passing marginal checks means almost nothing by itself. A synthetic dataset can match every single marginal perfectly while having completely destroyed the correlations between columns. Tall people weighing the same as short people. Patients with severe diagnoses showing identical lab values to healthy ones. Marginals are necessary. They are nowhere near sufficient.
Pairwise Correlations
Does the synthetic data preserve relationships between column pairs? For continuous-continuous pairs, compare Pearson and Spearman coefficients. For categorical pairs, compare contingency tables and Cramér's V. For mixed pairs, correlation ratios or ANOVA F-statistics.
A side-by-side correlation matrix heatmap (real vs. synthetic) is probably the single most informative visualization you can produce for synthetic data quality assessment. It immediately reveals whether the generator captured the dependency structure or collapsed to independent marginals.
Higher-Order Dependencies
Pairwise correlations miss multi-way interactions. A three-way dependency (the effect of age on blood pressure depends on medication status) can be destroyed while all pairwise correlations remain intact.
Evaluating this is harder because the number of possible interactions grows combinatorially, but there are practical approaches. Conditional distribution tests check whether the distribution of column A, conditioned on specific values of B and C, matches between real and synthetic data. Both Synthcity and SDV provide tools for this. Alternatively, train a classifier to distinguish real from synthetic records. If the model can tell them apart with accuracy significantly above 50%, something is different. This "discriminative score" is a holistic check that catches any systematic statistical difference a neural network can detect.
Full Joint Distribution
For a comprehensive assessment, compute distance metrics on the full joint distribution: Wasserstein distance (via random projections for high-dimensional data), maximum mean discrepancy (MMD), or Jensen-Shannon divergence on discretized joints. These produce single numbers useful for comparing synthesizers or tracking quality over time, but they sacrifice interpretability. A bad score tells you something's wrong without telling you what.
Downstream Utility
Distributional fidelity is a proxy. What actually matters: does the synthetic data work for its intended purpose?
Train on Synthetic, Test on Real (TSTR)
TSTR is the gold standard for ML use cases. The procedure:
-
Train a model on real data, evaluate on held-out real data. Record performance (accuracy, F1, AUC-ROC, RMSE). This is your ceiling.
-
Train the same model, same hyperparameters, on synthetic data. Evaluate on the same real test set. This measures the utility cost of using synthetic data.
-
Compute the gap. Small gap = high utility. Large gap = the synthetic data is missing something the model needs.
Important nuances: the choice of model matters. A logistic regression and a gradient-boosted ensemble may show very different TSTR gaps because they lean on different statistical properties. The metric matters too. A dataset that preserves majority-class behavior but trashes minority-class structure will look fine on accuracy and terrible on F1.
Run TSTR across multiple models and multiple metrics. If the synthetic data holds up across the board, you can trust it.
Analytical Query Accuracy
For dashboarding and reporting use cases, TSTR is less relevant. Instead, define representative queries (group-by aggregations, percentile calculations, conditional counts, cross-tabs) and compute relative error between results on real and synthetic data. The NIST DP Synthetic Data Challenge took exactly this approach, scoring datasets against hundreds of analytical queries.
Domain-Specific Validation
Some failures are invisible to generic statistical metrics. Synthetic clinical data should have systolic blood pressure higher than diastolic. Synthetic financial data should satisfy accounting identities. Synthetic sensor data shouldn't show 100-degree temperature jumps between consecutive readings.
Build these domain constraints into your evaluation pipeline. They catch things that no amount of distributional testing will find.
Privacy
If you're generating synthetic data for privacy-preserving sharing, evaluating privacy isn't optional. It's a core quality dimension.
Distance Metrics
How close are synthetic records to real ones? If every synthetic record hugs a real record, your generator is memorizing, and privacy protection is minimal.
Distance to Closest Record (DCR) measures the minimum distance from each synthetic record to any real record. Nearest Neighbor Distance Ratio (NNDR) normalizes DCR by the distance between real records. A healthy synthetic dataset shows DCR values comparable to or larger than inter-record distances in the real data. Synthetic records shouldn't cluster around real individuals.
Membership Inference Attacks
Can an attacker determine whether a specific person was in the training set? The standard approach uses shadow models: train multiple generators on known data subsets, build an attack classifier from the results, then apply it to the target. Synthcity provides built-in membership inference evaluation; SmartNoise includes attack notebooks.
A well-privatized dataset should yield attack accuracy near 50% (random chance). Significantly higher means information is leaking.
Attribute Inference
Even without determining membership, an attacker might infer sensitive attributes. If the synthetic data reveals that people with a specific age-zipcode-occupation combination overwhelmingly have a particular diagnosis, the attacker can infer a real person's diagnosis from their quasi-identifiers.
Test whether the synthetic data makes sensitive attribute prediction easier than publicly available information would allow.
Formal Guarantees
If the data was generated with differential privacy, the (ε, δ) guarantee provides a provable upper bound on leakage, regardless of the attack strategy. But don't rely on formal guarantees alone. Supplement with empirical testing. The ε tells you the theoretical worst case; the empirical metrics show what's actually happening in practice.
Fairness
Synthetic data can preserve, amplify, or smooth out biases from the original data. When synthetic data trains models that affect people (credit scoring, hiring, clinical decisions), fairness evaluation is mandatory.
Demographic parity. Compare outcome distributions across protected groups in real vs. synthetic data. If the real data shows 30% approval for Group A and 25% for Group B, the synthetic data should preserve that ratio. Not equalize it (unless you specifically asked for fairness-aware generation). Not widen the gap.
Conditional fairness. Check whether conditional relationships involving protected attributes are preserved. Does the relationship between income and loan approval stay the same within each demographic group?
Bias amplification. Some generators, especially those that struggle with underrepresented groups, can make existing biases worse. Compare downstream model performance across demographic groups using real-trained vs. synthetic-trained models. If the synthetic-trained model performs worse for minority groups, the generator is amplifying bias.
Interestingly, Pereira et al. (2024) found in PLOS ONE that DP synthetic data can sometimes improve fairness by smoothing out discriminatory patterns. A finding worth investigating rather than assuming.
Which Metrics for Which Use Case?
ML model training: TSTR across multiple models and metrics. Pairwise correlations. Discriminative score. Add fairness if the model affects individuals.
Analytical reporting: Query accuracy on representative queries. Marginal and conditional distributions.
External data sharing: Privacy evaluation is non-negotiable. Membership inference, attribute inference, and (if DP was used) the formal guarantee.
Protected populations: Fairness evaluation is non-negotiable. Demographic parity, bias amplification, conditional fairness for high-stakes applications.
Always, as a minimum: Marginal distributions and pairwise correlations. Fast, interpretable, catches gross failures.
Embed Evaluation in the Pipeline
None of this helps if it only happens once. Evaluation needs to run automatically after every generation run, with results stored alongside the synthetic data, compared against acceptance thresholds, and flagged when something falls below the bar.
Best-in-class platforms embed this natively: distributional fidelity, utility, privacy, and fairness in a single integrated assessment after every run. Threshold-based alerts prevent bad synthetic data from reaching consumers. Quality records accumulate for auditors and regulators.
That automation is what separates production systems from research tools. Skip evaluation and you're generating data on faith. Run it rigorously and you're generating data you can defend.
References: Synthcity evaluation metrics and Benchmarks API; SDV Quality Report and Diagnostic modules; Pereira et al. (2024), Assessment of DP Synthetic Data, PLOS ONE; Stadler, Oprisanu & Troncoso (2022), Synthetic Data — Anonymisation Groundhog Day, USENIX Security.