Few-shot LLM Synthetic Data with Distribution Matching
Jiyuan Ren, Zhaocheng Du, Zhihao Wen, Qinglin Jia, Sunhao Dai, Chuhan, Wu, Zhenhua Dong

TL;DR
This paper introduces SynAlign, a framework for generating and filtering synthetic data for LLMs that matches real data distributions using attribute matching and diversity exploration, improving downstream task performance.
Contribution
SynAlign combines uncertainty-based data selection and attribute reasoning with distribution matching to produce high-quality synthetic data that enhances model performance.
Findings
Significant performance improvements on multiple text prediction tasks.
Effective distribution matching with Maximum Mean Discrepancy.
Successful online A/B test on an online retriever.
Abstract
As large language models (LLMs) advance, their ability to perform in-context learning and few-shot language generation has improved significantly. This has spurred using LLMs to produce high-quality synthetic data to enhance the performance of smaller models like online retrievers or weak LLMs. However, LLM-generated synthetic data often differs from the real data in key language attributes (e.g., styles, tones, content proportions, etc.). As a result, mixing these synthetic data directly with real data may distort the original data distribution, potentially hindering performance improvements. To solve this, we introduce SynAlign: a synthetic data generation and filtering framework based on key attribute distribution matching. Before generation, SynAlign employs an uncertainty tracker surrogated by the Gaussian Process model to iteratively select data clusters distinct from selected…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Image Processing and 3D Reconstruction · Hydrology and Watershed Management Studies
MethodsGaussian Process
