Progress or Regress? Self-Improvement Reversal in Post-training

Ting Wu; Xuefeng Li; Pengfei Liu

arXiv:2407.05013·cs.CL·July 9, 2024

Progress or Regress? Self-Improvement Reversal in Post-training

Ting Wu, Xuefeng Li, Pengfei Liu

PDF

Open Access 3 Reviews

TL;DR

This paper critically examines post-training self-improvement methods for LLMs, revealing that apparent performance gains can mask regressions in broader capabilities like diversity and OOD generalization, highlighting the need for better evaluation metrics.

Contribution

It introduces a comprehensive evaluative framework and uncovers the phenomenon of self-improvement reversal, challenging assumptions about progress in post-training methods.

Findings

01

Models show improved benchmark performance but regress in diversity and OOD tasks.

02

Current post-training practices may not enhance models' ability to handle complex, real-world problems.

03

Critical evaluation metrics are necessary to accurately assess true progress in self-improvement.

Abstract

Self-improvement through post-training methods such as iterative preference learning has been acclaimed for enhancing the problem-solving capabilities (e.g., mathematical reasoning) of Large Language Models (LLMs) without human intervention. However, as exploration deepens, it becomes crucial to assess whether these improvements genuinely signify progress in solving more challenging problems or if they could lead to unintended regressions. To address this, we propose a comprehensive evaluative framework that goes beyond the superficial pass@1 metric to scrutinize the underlying enhancements of post-training paradigms for self-improvement. Through rigorous experimentation and analysis across diverse problem-solving tasks, the empirical results point out the phenomenon of \emph{self-improvement reversal}, where models showing improved performance across benchmarks will paradoxically…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

Good idea and experiments. It's an important and timely topic. The observation that some metrics like diversity are sometimes unmeasured with these methods is very important.

Weaknesses

I think the writing can be improved somewhat, even if the core ideas are clear.

Reviewer 02Rating 8Confidence 3

Strengths

1. The experiments are well-designed and provide a comprehensive understanding to the internal mechanisms of self-improvement. 2. This paper finds that such pass@1 accuracy metric may lead to a wrong judgment about model performance. 3. The experimental results in Figure 2 are insightful. The relationship between correct answer coverage and iterative post-training methods has not been discussed before.

Weaknesses

1. Some of the conclusions in this paper are as expected, e.g. results from Table 3 are related to [1]. [1] Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Reviewer 03Rating 6Confidence 3

Strengths

- Well-motivated and timely. Review of related work is thorough. - Iterative self-improvement post-training algorithm and most of the analyses seem general enough to be applied to arbitrary domains with minimal changes. - Interesting reversal results with diversity and OOD generalization decreasing during post-training. Used a variety of diversity metrics.

Weaknesses

- I'm not sure I understand Improvement Sets (Section 5.1). Perhaps I'm missing something here, but in my understanding of figure 3, it seems somewhat trivial that accuracy@N increases with greater N. - I'm also not sure I understand the point of Group Disparity (Section 5.3). Why not just compare level 1 accuracy vs. level 5 accuracy directly, if level 1 accuracy is increasing and level 5 is decreasing? - Correct answer coverage seems potentially at odds with solution diversity, so I'm not exac

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Resource Development and Performance Evaluation