BARE: Leveraging Base Language Models for Few-Shot Synthetic Data Generation

Alan Zhu; Parth Asawa; Jared Quincy Davis; Lingjiao Chen; Boris Hanin; Ion Stoica; Joseph E. Gonzalez; Matei Zaharia

arXiv:2502.01697·cs.CL·May 22, 2025

BARE: Leveraging Base Language Models for Few-Shot Synthetic Data Generation

Alan Zhu, Parth Asawa, Jared Quincy Davis, Lingjiao Chen, Boris Hanin, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces BARE, a two-stage method that leverages base language models for few-shot synthetic data generation, significantly enhancing diversity and quality of datasets from only a few seed examples, thereby improving downstream task performance.

Contribution

The paper proposes BARE, a novel approach combining base models and instruction-tuned models for high-quality, diverse synthetic data generation from minimal seed examples.

Findings

01

BARE generates diverse, high-quality datasets from just 3 seed examples.

02

Fine-tuning Llama 3.1 8B with BARE data matches state-of-the-art performance.

03

BARE improves downstream task performance significantly over existing methods.

Abstract

As the demand for high-quality data in model training grows, researchers and developers are increasingly generating synthetic data to tune and train LLMs. However, current data generation methods rely on seed sets containing tens of thousands of examples to prompt instruction-tuned models. This reliance can be especially problematic when the curation of high-quality examples is expensive or difficult. In this paper we explore the novel few-shot synthetic data generation setting -- generating a high-quality dataset from a few examples. We show that when working with only a few seed examples, instruction-tuned models used in current synthetic data methods produce insufficient diversity for downstream tasks. In contrast, we show that base models without post-training, largely untapped for synthetic data generation, offer substantially greater output diversity, albeit with lower instruction…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The paper is well motivated and well written. The proposed approach is quite simple and easy to implement. The downstream evaluation with synthetic data shows positive results compared to the baselines. In particular, the results on the LCB leaderboard with BARE instructions are quite promising.

Weaknesses

**Missing key related work.** Many of the findings and techniques in this paper have been introduced in prior work [a, b]. However, the paper does not discuss any of these works, either in the related work or the experiments. URIAL [a] uses three prompts to enable instruction-following abilities in base models. BARE also uses a similar three-prompt strategy to generate synthetic data with the base model. The authors should highlight the differences between the two methods. Furthermore, ALMA [b

Reviewer 02Rating 4Confidence 3

Strengths

- This paper explores an important and interesting research direction. - The evaluation has included lots of in-depth study and discussion on the quality of synthetic data.

Weaknesses

- One major concern is the scalability of the method. While the evaluation has included different model series (i.e., Llama-3.1 and Qwen3) and benchmarks, the models used for training is limited to 8B-level and the training set only includes 1000 samples. More experiments on larger models with much larger training sets are needed to study whether the performance gain BARE brings can scale up consistently. - The relationship between downstream accuracy and indistinguishability rate is unclear. S

Reviewer 03Rating 4Confidence 3

Strengths

1. The observation that base models retain diversity while instruct models lose it due to post-training is well-motivated and empirically verified. 2. BARE is a simple yet effective way to combine strengths of base and instruct models, addressing the overlooked potential of base models in data generation.

Weaknesses

1. While combining base and instruct models is insightful, the method itself (generate + refine) is conceptually simple and resembles earlier “draft–refine” pipelines. The main novelty lies in the empirical insight rather than algorithmic design. 2. The success of BARE heavily depends on the choice and strength of the refiner model (e.g., GPT-4o vs. Llama-Instruct), which limits reproducibility for weaker setups.

Code & Models

Datasets

DataPilot/Zero_SFT_Ja_v3.5
dataset· 92 dl
92 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling

MethodsAttention Is All You Need · Linear Warmup With Linear Decay · Softmax · Attention Dropout · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Residual Connection · Byte Pair Encoding · Weight Decay