Using Synthetic Data to estimate the True Error is theoretically and practically doable

Hai Hoang Thanh; Duy-Tung Nguyen; Hung The Tran; Khoat Than

arXiv:2511.00964·cs.LG·November 4, 2025

Using Synthetic Data to estimate the True Error is theoretically and practically doable

Hai Hoang Thanh, Duy-Tung Nguyen, Hung The Tran, Khoat Than

PDF

Open Access

TL;DR

This paper explores the use of high-quality synthetic data to accurately estimate a machine learning model's true error when limited labeled data is available, supported by new theoretical bounds and practical methods.

Contribution

It introduces novel generalization bounds incorporating synthetic data and proposes a method to generate optimized synthetic samples for reliable model evaluation.

Findings

01

Synthetic data can effectively estimate true model error.

02

The proposed method outperforms existing baselines in accuracy.

03

Theoretical bounds highlight the importance of generator quality.

Abstract

Accurately evaluating model performance is crucial for deploying machine learning systems in real-world applications. Traditional methods often require a sufficiently large labeled test set to ensure a reliable evaluation. However, in many contexts, a large labeled dataset is costly and labor-intensive. Therefore, we sometimes have to do evaluation by a few labeled samples, which is theoretically challenging. Recent advances in generative models offer a promising alternative by enabling the synthesis of high-quality data. In this work, we make a systematic investigation about the use of synthetic data to estimate the test error of a trained model under limited labeled data conditions. To this end, we develop novel generalization bounds that take synthetic data into account. Those bounds suggest novel ways to optimize synthetic samples for evaluation and theoretically reveal the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Machine Learning and Data Classification · Adversarial Robustness in Machine Learning