SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection
Kexian Tang, Jiani Wang, Shaowen Wang, Kaifeng Lyu

TL;DR
SPA demonstrates that simple, carefully designed prompt-based synthetic data generation can effectively enhance knowledge in large language models, outperforming more complex methods in specialized domains.
Contribution
The paper introduces SPA, a straightforward prompt-based augmentation method that serves as a strong baseline for knowledge injection in LLMs, highlighting its effectiveness over complex approaches.
Findings
SPA outperforms several strong baselines in knowledge injection tasks.
RL-based methods face diversity collapse at large data scales.
Prompt tuning can diminish the advantages of multi-stage prompting.
Abstract
While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Natural Language Processing Techniques
