HeadlinesBriefing favicon HeadlinesBriefing.com

Synthetic data passes tests but hides fatal model drift

Towards Data Science •
×

A team celebrated a synthetic data pipeline that passed every internal check. KL divergence stayed within limits, and the Train‑on‑Synthetic, Test‑on‑Real (TSTR) run returned 91% accuracy, only a couple points shy of the 93% achieved with real data. Membership‑inference risk appeared low, and the dataset received a safety certification before the model went live. The team also verified encrypted storage, bolstering confidence.

Later analysis revealed the metrics missed a deeper flaw. Fidelity tests such as KL and KS compare marginal distributions but ignore feature interactions, so correlation drift went unchecked. A correlation drift score—the Frobenius norm of the difference between real and synthetic correlation matrices—exceeds 0.5 when structure loss is problematic. Running the score each iteration lets engineers catch drift early, avoiding costly retraining.

The TSTR average masked a tail‑loss problem: while overall AUC stayed high, the lowest decile of high‑value transactions plunged to 67%, exposing the model’s inability to handle rare scenarios. Splitting TSTR results by target deciles and enforcing a correlation‑drift threshold gives practitioners a clearer safety net, preventing silent failures for business once models enter production environments.