Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration
Qintong Li, Jiahui Gao, Sheng Wang, Renjie Pi, Xueliang Zhao, Chuan, Wu, Xin Jiang, Zhenguo Li, Lingpeng Kong

TL;DR
This paper introduces ReverseGen, a novel method that automatically generates failure-inducing queries to create diverse, effective training data, significantly improving LLM performance by exposing and addressing their weaknesses.
Contribution
ReverseGen is a new approach that trains a proposer to generate queries leading to model failures, enabling automatic, targeted data synthesis without manual templates.
Findings
ReverseGen-generated data outperforms human-annotated data in training effectiveness.
The method improves safety, honesty, and math capabilities of LLMs.
Applicable to models of various scales (3B, 7B, 8B).
Abstract
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data, leading to impressive performance across a range of downstream applications. Current methods often rely on human-annotated data or predefined task templates to direct powerful LLMs in synthesizing task-relevant data for effective model training. However, this dependence on manually designed components may constrain the scope of generated data, potentially overlooking critical edge cases or novel scenarios that could challenge the model. In this paper, we present a novel approach, ReverseGen, designed to automatically generate effective training samples that expose the weaknesses of LLMs. Specifically, we introduce a dedicated proposer trained to produce queries that lead target models to generate unsatisfactory responses. These failure-inducing queries are then used to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Advanced Data Storage Technologies · Data Quality and Management
