Enhancing LLM Reasoning via Critique Models with Test-Time and   Training-Time Supervision

Zhiheng Xi; Dingwen Yang; Jixuan Huang; Jiafu Tang; Guanyu Li; Yiwen; Ding; Wei He; Boyang Hong; Shihan Do; Wenyu Zhan; Xiao Wang; Rui Zheng; Tao; Ji; Xiaowei Shi; Yitao Zhai; Rongxiang Weng; Jingang Wang; Xunliang Cai; Tao; Gui; Zuxuan Wu; Qi Zhang; Xipeng Qiu; Xuanjing Huang; Yu-Gang Jiang

arXiv:2411.16579·cs.CL·November 26, 2024

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision

Zhiheng Xi, Dingwen Yang, Jixuan Huang, Jiafu Tang, Guanyu Li, Yiwen, Ding, Wei He, Boyang Hong, Shihan Do, Wenyu Zhan, Xiao Wang, Rui Zheng, Tao, Ji, Xiaowei Shi, Yitao Zhai, Rongxiang Weng, Jingang Wang, Xunliang Cai, Tao, Gui, Zuxuan Wu, Qi Zhang, Xipeng Qiu, Xuanjing Huang

PDF

Open Access

TL;DR

This paper introduces a critique-based framework for improving large language models' reasoning by using step-level feedback during training and testing, leading to enhanced performance on complex tasks.

Contribution

It presents AutoMathCritique, a large dataset for critique data, and proposes a critique-in-the-loop self-training method to boost reasoning capabilities of LLMs.

Findings

01

Critique models improve reasoning on difficult queries.

02

Scaling inference-time computation enhances performance.

03

Critique supervision boosts exploration and solution diversity.

Abstract

Training large language models (LLMs) to spend more time thinking and reflection before responding is crucial for effectively solving complex reasoning tasks in fields such as science, coding, and mathematics. However, the effectiveness of mechanisms like self-reflection and self-correction depends on the model's capacity to accurately assess its own performance, which can be limited by factors such as initial accuracy, question difficulty, and the lack of external feedback. In this paper, we delve into a two-player paradigm that separates the roles of reasoning and critique models, where the critique model provides step-level feedback to supervise the reasoning (actor) model during both test-time and train-time. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data, resulting in a dataset of $76, 321$ responses paired with step-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Artificial Intelligence in Law