ReGenesis: LLMs can Grow into Reasoning Generalists via Self-Improvement
Xiangyu Peng, Congying Xia, Xinyi Yang, Caiming Xiong, Chien-Sheng Wu,, Chen Xing

TL;DR
ReGenesis enables LLMs to improve their reasoning abilities through self-synthesized reasoning paths, progressing from abstract to concrete, without requiring human supervision, thereby enhancing performance on diverse reasoning tasks.
Contribution
The paper introduces ReGenesis, a novel method for LLMs to self-improve their reasoning by generating task-agnostic reasoning paths without human-designed examples.
Findings
ReGenesis outperforms existing methods on all tested in-domain and OOD tasks.
ReGenesis achieves a 6.1% performance increase on six OOD tasks, reversing previous declines.
The framework is effective across various LLMs and design choices.
Abstract
Post-training Large Language Models (LLMs) with explicit reasoning trajectories can enhance their reasoning abilities. However, acquiring such high-quality trajectory data typically demands meticulous supervision from humans or superior models, which can be either expensive or license-constrained. In this paper, we explore how far an LLM can improve its reasoning by self-synthesizing reasoning paths as training data without any additional supervision. Existing self-synthesizing methods, such as STaR, suffer from poor generalization to out-of-domain (OOD) reasoning tasks. We hypothesize it is due to that their self-synthesized reasoning paths are too task-specific, lacking general task-agnostic reasoning guidance. To address this, we propose Reasoning Generalist via Self-Improvement (ReGenesis), a method to self-synthesize reasoning paths as post-training data by progressing from…
Peer Reviews
Decision·ICLR 2025 Oral
- The problem of building a reasoning generalist is important, compared to other works focusing on dataset-specific improvements in LLMs. - Comprehensive experiments highlight ReGenesis’s superior performance in both in-domain and out-of-domain tasks.
- The major concern is that there is a mismatch between the motivation and the proposed method. The paper pointed out that the bottleneck of self-synthesizing methods is that they suffer from poor generalization to out-of-domain (OOD), and hypothesized that it is because "their self-synthesized reasoning paths are too task-specific, lacking general task-agnostic reasoning guidance". However, even though they proposed a multi-level prompting method starting from general guidelines, in the end, th
1. The work provides a fresh insight into a key limitation in reasoning synthesis for LLMs, observing that existing self-synthesized reasoning methods often experience substantial performance drops in out-of-domain (OOD) settings. 2. The authors propose a novel abstract-to-concrete synthesis route, progressively transitioning from general reasoning guidelines to task-specific reasoning paths. 3. This paper presents comprehensive experiment, including extensive comparisons with competitive baseli
1. Although the paper includes a comprehensive ablation study (Tables 4 and 5) to examine the components of the ReGenesis framework, the fundamental reasons behind the performance improvements remain ambiguous. The framework is designed to construct reasoning solutions that incorporate *general task-agnostic reasoning guidance*. However, it is unclear how the ReGenesis-generated reasoning chains retain this generalizability across tasks. According to Table 25, the generated reasoning chains appe
ReGenesis introduces a structured approach for LLM self-improvement by creating task-agnostic, generalizable reasoning paths, which is a substantial departure from the task-specific paths in prior methods. ReGenesis shows superior performance in OOD tasks, addressing a major limitation of existing self-synthesizing methods. The model's flexibility across multiple reasoning domains suggests broader applicability. The authors conducted thorough evaluations across various datasets, including math
1. I believe the paper presents a valuable approach by establishing task-agnostic general reasoning guidelines. However, I am curious whether employing a more advanced model, such as GPT-4, to formulate these guidelines might yield better results than having them generated by the 7B model itself. 2. I think many benchmarks in Table 3 might not fully qualify as out-of-domain (OOD) tasks. Previous work has often used GSM8K’s training set as demonstrations for datasets like ASDIV and SVAMP, which
Videos
Taxonomy
TopicsNatural Language Processing Techniques
