EmbGen: Teaching with Reassembled Corpora
Arun K Lenin, Kai Rouse, Andrea Nicastro, Anna Leontjeva

TL;DR
EmbGen is a novel synthetic data generation pipeline that improves instruction-tuned models by capturing complex domain dependencies through entity-based reassembly and targeted question-answer generation.
Contribution
It introduces a new method for decomposing and reassembling corpora to generate more diverse and structured training data for domain adaptation.
Findings
EmbGen outperforms baselines on heterogeneous datasets in Binary Accuracy.
It achieves up to 88.9% improvement at 20M tokens.
EmbGen maintains competitiveness on less heterogeneous datasets.
Abstract
Adapting small instruction-tuned models to specialized domains often relies on supervised fine-tuning (SFT) on curated instruction-response examples, which is expensive to collect at scale. Synthetic training examples generated by a teacher LLM from a domain corpus can reduce this cost, but existing pipelines can produce homogenized outputs and do not consistently capture cross-passage or cross-document dependencies. We introduce EmbGen, a synthetic data generation pipeline that decomposes a corpus into entity-description pairs, reassembles them using semantic structure inferred from embedding similarity, and then generates question-answer (QA) pairs via proximity, intra-cluster, and inter-cluster sampling with cluster-specialized system prompts. We evaluate EmbGen against EntiGraph, InstructLab and Knowledge-Instruct on three datasets of varied semantic heterogeneity, under fixed token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
