Team, Then Trim: An Assembly-Line LLM Framework for High-Quality Tabular Data Generation
Congjing Zhang, Ryan Feng Lin, Ruoxuan Bao, Shuai Huang

TL;DR
This paper introduces T$^2$, a novel framework using collaborative LLMs and a rigorous quality control pipeline to generate high-quality synthetic tabular data, addressing data scarcity and quality issues in ML applications.
Contribution
The paper presents a new assembly-line LLM framework with a three-stage QC pipeline for synthesizing superior tabular data, advancing data generation techniques.
Findings
T$^2$ outperforms existing methods in data quality metrics.
Empirical results show improved downstream model performance.
Framework effectively addresses class imbalance and bias issues.
Abstract
While tabular data is fundamental to many real-world machine learning (ML) applications, acquiring high-quality tabular data is usually labor-intensive and expensive. Limited by the scarcity of observations, tabular datasets often exhibit critical deficiencies, such as class imbalance, selection bias, and low fidelity. To address these challenges, building on recent advances in Large Language Models (LLMs), this paper introduces Team-then-Trim (T), a framework that synthesizes high-quality tabular data through a collaborative team of LLMs, followed by a rigorous three-stage plug-in data quality control (QC) pipeline. In T, tabular data generation is conceptualized as a manufacturing process: specialized LLMs, guided by domain knowledge, are tasked with generating different data components sequentially, and the resulting products, i.e., the synthetic data, are systematically…
Peer Reviews
Decision·Submitted to ICLR 2026
- The team-then-trim structure separates generation from post-hoc quality control, providing robustness against LLM hallucination. - The three-stage quality control pipeline (sanity, objective-driven filtering, diversity enforcement) is systematic and targets well-known challenges in synthetic data generation, including invalid entries, distributional bias, and limited incremental information. - The use of model-based scoring and information-gain comparison to filter batches offers a principled
- The quality control pipeline assumes access to a reasonably performant base model and sufficient initial real data to bootstrap quality signals, which can limit applicability in low-data or scarce-label settings (including simulated data incompleteness setting in the paper). - The method incurs non-trivial computational overhead due to repeated generation, batch scoring, and rejection loops. The generation resource trade-offs are not fully addressed. - The reliance on a single trained classifi
- Leverages structural knowledge of the data during generation - Incorporates multi-level quality checks to ensure high-quality data from different points software view: sanity, utility, and diversity - Allows for the recovery of data subgroups missing in the original data
- Evaluation against related work misses typical tabular generators, e.g., GReaT [1] and Tabula [2], and in particular also any other agentic LLM, e.g., [3] or diffusion-based ones, e.g., [4]. - All LLMs in the evaluation seem to be of the same type, i.e., Llama 3.3 70B Instruct, but the power of this method could also be to use more targeted LLMs for the different roles, coordinator vs worker, or for specific features. No evaluation in this direction has been done. - Following that, the same LL
- Overall idea and analogy of assembly line workers is intuitive enough. - The paper is clear to understand and well presented.
- [**Experiment on recent baselines**] Addition of more recent baselines, especially the ones that explored the usage of LLMs for tabular generation [1, 2, 3] will strengthen the paper. Moreover, ‘team-then-trim’ has some similarities with [1] in terms of using specialized model components per column/subset of columns (MoEs for [1], worker LLMs here), so it is also important to compare and contrast the pros and cons in related works. - [**Experiment on model sizes**] Varying model sizes will be
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Imbalanced Data Classification Techniques · Machine Learning and Data Classification
