Progress or Regress? Self-Improvement Reversal in Post-training
Ting Wu, Xuefeng Li, Pengfei Liu

TL;DR
This paper critically examines post-training self-improvement methods for LLMs, revealing that apparent performance gains can mask regressions in broader capabilities like diversity and OOD generalization, highlighting the need for better evaluation metrics.
Contribution
It introduces a comprehensive evaluative framework and uncovers the phenomenon of self-improvement reversal, challenging assumptions about progress in post-training methods.
Findings
Models show improved benchmark performance but regress in diversity and OOD tasks.
Current post-training practices may not enhance models' ability to handle complex, real-world problems.
Critical evaluation metrics are necessary to accurately assess true progress in self-improvement.
Abstract
Self-improvement through post-training methods such as iterative preference learning has been acclaimed for enhancing the problem-solving capabilities (e.g., mathematical reasoning) of Large Language Models (LLMs) without human intervention. However, as exploration deepens, it becomes crucial to assess whether these improvements genuinely signify progress in solving more challenging problems or if they could lead to unintended regressions. To address this, we propose a comprehensive evaluative framework that goes beyond the superficial pass@1 metric to scrutinize the underlying enhancements of post-training paradigms for self-improvement. Through rigorous experimentation and analysis across diverse problem-solving tasks, the empirical results point out the phenomenon of \emph{self-improvement reversal}, where models showing improved performance across benchmarks will paradoxically…
Peer Reviews
Decision·ICLR 2025 Poster
Good idea and experiments. It's an important and timely topic. The observation that some metrics like diversity are sometimes unmeasured with these methods is very important.
I think the writing can be improved somewhat, even if the core ideas are clear.
1. The experiments are well-designed and provide a comprehensive understanding to the internal mechanisms of self-improvement. 2. This paper finds that such pass@1 accuracy metric may lead to a wrong judgment about model performance. 3. The experimental results in Figure 2 are insightful. The relationship between correct answer coverage and iterative post-training methods has not been discussed before.
1. Some of the conclusions in this paper are as expected, e.g. results from Table 3 are related to [1]. [1] Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
- Well-motivated and timely. Review of related work is thorough. - Iterative self-improvement post-training algorithm and most of the analyses seem general enough to be applied to arbitrary domains with minimal changes. - Interesting reversal results with diversity and OOD generalization decreasing during post-training. Used a variety of diversity metrics.
- I'm not sure I understand Improvement Sets (Section 5.1). Perhaps I'm missing something here, but in my understanding of figure 3, it seems somewhat trivial that accuracy@N increases with greater N. - I'm also not sure I understand the point of Group Disparity (Section 5.3). Why not just compare level 1 accuracy vs. level 5 accuracy directly, if level 1 accuracy is increasing and level 5 is decreasing? - Correct answer coverage seems potentially at odds with solution diversity, so I'm not exac
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Resource Development and Performance Evaluation
