Entrobit TeamApril 15, 20266 min read

Bayesian Networks and Copula Models: Classical Approaches to Synthetic Data That Still Deliver

Published by Entrobit · April 2026

Deep Learning Isn't Always the Answer

The synthetic data conversation in 2026 has a deep learning fixation. GANs, VAEs, diffusion models. Conference papers with bigger architectures. Industry buzz around the most computationally expensive approaches.

This creates a blind spot that costs organizations real time and money. For a large class of practical problems (datasets under 10,000 rows, low-to-moderate dimensionality, need for interpretability, constrained compute budgets) classical statistical methods consistently outperform deep learning. They train in seconds, need no GPU, produce models you can inspect, and handle differential privacy with less utility loss.

This post makes the case for reaching for classical methods first in more situations than most practitioners realize. Not as deep learning replacements. As the right default for a surprisingly large fraction of real-world synthetic data problems.

Bayesian Network Synthesis: PrivBayes

How It Works

A Bayesian Network is a directed acyclic graph (DAG) encoding the conditional independence structure of a joint distribution. Each node is a variable, each edge a direct dependency. Missing edges mean conditional independence given parent variables.

The joint distribution factorizes along the DAG:

P(X₁, ..., Xₙ) = ∏ P(Xᵢ | Pa(Xᵢ))

Instead of estimating the full n-dimensional joint (exponentially expensive), you estimate a collection of lower-dimensional conditional distributions. The graph tells you which ones matter.

PrivBayes Under the Hood

PrivBayes (Zhang et al., ACM TODS 2017) adapts this for differential privacy in three phases.

Structure learning. Greedily select parent sets maximizing mutual information, with Laplace noise on the mutual information scores and the Exponential Mechanism for selection. This makes the choice of which statistics to measure private.

Parameter learning. Estimate conditional probability tables from noisy marginal counts: perturb frequency counts with Laplace noise to achieve DP.

Sampling. Traverse variables in topological order, sampling each from its conditional distribution given already-assigned parent values.

Total ε splits between structure and parameter learning. The right split is a tuning parameter, but the algorithm is forgiving.

Why It Wins on Small Data

A Bayesian Network with k-parent limit learns O(n · |domain|^k) parameters. A CTGAN or TVAE learns millions. With fewer parameters, you need less data for reliable estimates, and less noise to privatize them.

On 1,000-5,000 row datasets, PrivBayes routinely produces higher-fidelity synthetic data than CTGAN at ε values 2-5× smaller. Training time: seconds.

The limitation is representational capacity. Limited parent sets can't capture complex, high-order interactions. On large, high-dimensional datasets with intricate dependencies, deep learning will eventually win because the data provides enough signal to estimate the larger parameter space.

Gaussian Copula Models

The Idea

A copula separates marginal distributions from dependency structure. Sklar's theorem (1959): any multivariate distribution decomposes into marginals plus a copula capturing their dependencies.

The Gaussian Copula approach in five steps:

Estimate each variable's marginal distribution independently.
Apply the probability integral transform: push each variable through its CDF to get uniform values on [0, 1].
Apply the inverse standard normal CDF: get approximately normal values.
Estimate the correlation matrix of these transformed values. This matrix is the copula.
Generate: sample from a multivariate normal with this correlation matrix, reverse the transforms back to the original scale.

SDV's GaussianCopula synthesizer implements exactly this, with additional handling for categorical variables via reversible data transforms.

Strengths and Limits

Fast, stable, deterministic. Pairwise correlations are preserved exactly (by construction). Mixed data types handled through the marginal-copula decomposition.

The fundamental limitation: only linear correlation is captured in the transformed space. Non-linear dependencies, tail dependencies (variables more correlated during extreme events than normal ones), and multi-modal joints are poorly represented. For financial data with volatility clustering or clinical data with threshold effects, Gaussian Copula will miss important structure.

Despite this, it's remarkably competitive in practice. In SDV benchmarks, Gaussian Copula frequently scores within 5-10% of CTGAN quality while training 100× faster. For exploratory analysis, dashboards, and software testing, that level of fidelity is usually enough.

CART-Based Synthesis: synthpop

The synthpop package (Nowok et al., 2016) generates data using Classification and Regression Trees. Variables are synthesized sequentially, each one conditioned on previously generated variables. For each variable, CART partitions the feature space into leaves with approximately homogeneous values. Synthetic values are sampled from the empirical distribution within the selected leaf.

CART handles mixed data types natively (it splits on both continuous and categorical features). It captures non-linear relationships and interactions that piecewise-constant functions can approximate. It's fast, interpretable, and needs minimal tuning.

The catch: variable ordering matters. Synthesizing height before weight captures the height-weight dependency well; the reverse might not. CART also doesn't naturally support DP, though extensions exist.

On datasets with strong conditional relationships, especially step functions, threshold effects, and decision boundaries, synthpop can outperform both Gaussian Copula (can't capture non-linearity) and CTGAN (may not converge on small samples).

When Classical Methods Win: The Empirical Picture

Drawing from SDV evaluations, Synthcity benchmarks, and Del Gobbo (2025):

Under 1,000 rows. Classical methods dominate. CTGAN and TVAE overfit or mode-collapse. PrivBayes and Gaussian Copula produce more stable, higher-fidelity output. Training: seconds vs. minutes.

1,000-10,000 rows. Classical methods remain competitive. Gaussian Copula matches or beats CTGAN on many benchmarks. PrivBayes is the clear winner for DP. TVAE starts becoming viable. Training: seconds vs. minutes.

10,000-50,000 rows. Mixed. Deep learning starts leveraging its capacity. CTGAN and TVAE pull ahead on high-dimensional data with complex dependencies. Classical methods hold on lower-dimensional data. Training: seconds vs. tens of minutes.

Over 50,000 rows. Deep learning generally outperforms, especially on high-dimensional, complex data. TabDDPM hits highest fidelity scores. Classical methods remain useful for fast iteration and baselines. Training: seconds vs. hours.

By data type: Low-dimensional categorical? MST and PrivBayes win everywhere, including under DP. Moderate-dimensional mixed? Gaussian Copula and synthpop are competitive with CTGAN. High-dimensional continuous? Deep learning pulls ahead.

By compute budget: If GPU time is expensive or unavailable, classical methods are the only option at scale. Gaussian Copula generates a million rows from a 50-column dataset in under a minute on a laptop. CTGAN needs 30-60 minutes of GPU time.

Seeing the Bayesian Network

Consider a medical dataset with five variables: Age, Gender, BMI, Blood Pressure, Diabetes Status.

A learned Bayesian Network might encode:

Age → Blood Pressure, Age → BMI, Age → Diabetes Status
Gender → BMI
BMI → Blood Pressure, BMI → Diabetes Status

This DAG says: given BMI and Age, Diabetes Status is conditionally independent of Gender. That assumption reduces parameters and focuses the model on the most informative relationships.

The DAG is also inspectable. A clinician can look at it and say whether the encoded dependencies make medical sense. Try getting that from a GAN's latent space.

Match the Method to the Data

The lesson isn't that classical is better or deep learning is better. It's that the optimal choice depends on dataset size, dimensionality, data types, compute constraints, and privacy requirements.

Mature platforms include both classical and deep methods. Profile the input, recommend a synthesizer, enable systematic comparison through unified evaluation. An organization limited to CTGAN produces bad synthetic data on small datasets. One limited to Gaussian Copula produces bad synthetic data on complex, high-dimensional datasets. Consistently good results come from matching method to data, and that requires having both available.

When the data is small and the requirements are clear, classical methods don't just compete with deep learning. They win.

References: Zhang et al. (2017), PrivBayes: Private Data Release via Bayesian Networks, ACM TODS; Patki, Wedge & Veeramachaneni (2016), The Synthetic Data Vault, IEEE DSAA; Nowok, Raab & Dibben (2016), synthpop; SDV GaussianCopula documentation; Miletic et al. (2025), Synthetic data for pharmacogenetics, JAMIA Open.