Entrobit TeamApril 18, 20265 min read

Fairness-Aware Synthetic Data: When Generating More Data Means Generating Fairer Data

Your training dataset is biased. This isn't an accusation — it's a near-certainty. Historical data reflects historical decisions, and historical decisions were often discriminatory. A lending dataset from 2010-2020 encodes a decade of biased approval patterns. A hiring dataset reflects years of résumé screening that penalized non-Western names. A clinical dataset underrepresents populations that had less access to healthcare.

Train a model on biased data and the model learns the bias. This isn't news. What's less obvious is that synthetic data generation can either amplify that bias or deliberately correct it — depending on how you generate.

How Standard Generators Handle Bias

A standard synthetic data generator (CTGAN, TVAE, Gaussian Copula) learns the joint distribution of the training data and reproduces it faithfully. That faithfulness is the problem. If the training data has a 40% approval rate for Group A and a 25% rate for Group B — even after controlling for creditworthiness — the synthetic data will faithfully reproduce that disparity.

Worse, some generators amplify bias. Stadler et al. (2022) showed that generators struggling with underrepresented groups tend to over-smooth their distributions, effectively erasing the statistical signal that distinguishes minority subpopulations. A generator that represents Group B as a blurred version of the overall population is more harmful than one that reproduces the original bias, because it destroys the information needed to detect and measure the disparity.

Three Approaches to Fairness-Aware Generation

1. Rebalancing During Generation

The simplest intervention: generate synthetic data with equal representation across protected groups. If the training data has 70% Group A and 30% Group B, generate synthetic data at 50/50 (or at the population rate, if known).

# Conditional generation with SDV
from sdv.single_table import CTGANSynthesizer

synth = CTGANSynthesizer(metadata)
synth.fit(training_data)

# Generate balanced representation
conditions = [
    Condition(num_rows=5000, column_values={'group': 'A'}),
    Condition(num_rows=5000, column_values={'group': 'B'}),
]
balanced_data = synth.sample_conditions(conditions)

This doesn't change the conditional distributions (the approval rate within each group stays the same), but it ensures the model sees enough examples from both groups to learn group-specific patterns rather than treating the minority group as noise.

2. Conditional Fairness Constraints

Go further: generate synthetic data where the outcome distribution is independent of the protected attribute, conditional on legitimate features. If creditworthiness score X should produce the same approval probability regardless of group membership, enforce that in the generation process.

Synthcity's fairness-aware plugins support this through constrained generation. The generator is trained with an additional penalty term that pushes conditional outcome distributions toward parity across groups.

This is more aggressive than rebalancing — it changes the data's conditional relationships, not just the marginals. Use it when the goal is to train models that satisfy demographic parity or equalized odds constraints.

3. Causal Debiasing

The most sophisticated approach: model the causal structure of the data, identify the causal pathways through which the protected attribute influences the outcome, and generate synthetic data that blocks discriminatory pathways while preserving legitimate ones.

If Group B has lower approval rates partly because of lower income (legitimate predictor) and partly because of direct discrimination (illegitimate), the causal approach preserves the income effect while removing the direct discrimination effect.

This requires a causal DAG — a directed acyclic graph specifying which variables cause which. Building the DAG requires domain expertise. But the result is synthetic data that's fair in a more principled sense than simple statistical parity.

Measuring Whether It Worked

Fairness-aware generation is only useful if you can verify the output.

Demographic parity. Compare outcome rates across groups. If the real data had a 15-point approval gap and the synthetic data has a 2-point gap, the intervention worked (or at least shifted the numbers — whether the new gap is "fair" is a normative question).

Equalized odds. Compare true positive rates and false positive rates across groups. A model trained on the synthetic data should have similar error rates for both groups, conditional on the actual outcome.

Conditional fairness. Within each stratum of legitimate predictors (e.g., credit score bands), outcome distributions should be similar across groups.

Utility preservation. The fairness intervention shouldn't destroy predictive power. Run TSTR (Train on Synthetic, Test on Real) across both groups. If the model trained on fair synthetic data performs significantly worse overall, the intervention was too aggressive.

Pereira et al. (2024) found something encouraging in PLOS ONE: differentially private synthetic data sometimes improves fairness metrics without any explicit fairness intervention, because the DP noise smooths out discriminatory patterns. This is a side effect, not a guarantee — but it means DP and fairness can be complementary rather than competing goals.

The Ethical Nuance

Fairness-aware generation changes the data. That's the point, but it carries responsibilities.

Document what you changed. If synthetic data is used for model training, downstream users need to know that the data was fairness-adjusted and how. A model trained on rebalanced data behaves differently from one trained on the original distribution. Both may be appropriate for different purposes.

Don't conflate statistical parity with fairness. Equal approval rates across groups isn't necessarily fair if the groups have genuinely different risk profiles. Conversely, unequal rates aren't necessarily unfair. The choice of fairness criterion is a policy decision, not a technical one. Synthetic data generation operationalizes that decision — it doesn't make it for you.

Validate on real data. Always test fair-synthetic-trained models against real held-out data, stratified by protected group. The synthetic data should produce a fairer model that still performs well. If it produces a fairer model that performs terribly, something went wrong in the generation.

When to Use Fairness-Aware Generation

Regulatory mandates. The EU AI Act (Articles 10 and 15) requires bias assessment and mitigation for high-risk AI systems. Fair synthetic training data is one mitigation pathway.

Litigation risk. Models used in lending, hiring, insurance, and criminal justice face disparate impact scrutiny. Training on demonstrably fair data strengthens your defense.

Ethical commitment. Some organizations generate fair synthetic data because it's the right thing to do, independent of legal requirements. The technical capability exists. The question is whether to use it.

References: Pereira et al. (2024), "Assessment of DP Synthetic Data for Utility and Fairness," PLOS ONE; Synthcity fairness-aware plugins documentation; EU AI Act Articles 10, 15; Barocas, Hardt & Narayanan (2019), "Fairness and Machine Learning" (fairmlbook.org); Xu et al. (2019), CTGAN conditional generation, NeurIPS.