A Real-Calibrated Synthetic-First Data Engine

Yukang Shen

arXiv:2605.09699·eess.IV·May 12, 2026

A Real-Calibrated Synthetic-First Data Engine

Yukang Shen

PDF

TL;DR

This paper introduces a modular data engineering framework that systematically constructs high-quality synthetic datasets using controllable diffusion models, improving low-data domain performance in computer vision tasks.

Contribution

It presents a flexible, reproducible pipeline for dataset construction combining diffusion generation, filtering, and validation, enhancing synthetic augmentation reliability without new generative algorithms.

Findings

01

Synthetic data improves pose estimation when used with real data as augmentation.

02

Synthetic-only training performs below real-only training.

03

The framework emphasizes reproducibility and practical deployment in real-world workflows.

Abstract

Modern computer vision systems increasingly encounter performance limitations in data-scarce domains, where collecting large-scale, high-quality labeled data is costly or impractical. While controllable diffusion models enable scalable synthetic image generation, directly applying synthetic augmentation often leads to unstable performance gains due to dataset-level quality issues and insufficient feedback mechanisms. In this work, we present a Real-Calibrated Synthetic-First Data Engine, a modular data engineering framework that combines controllable diffusion generation and multi-stage curation/filtering within a unified pipeline, with optional support for uncertainty-driven selection and human verification. Instead of introducing new generative algorithms, our approach focuses on systematic dataset construction for improving the practical reliability of synthetic augmentation in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.