TL;DR
ReCrit introduces a transition-aware reinforcement learning framework to improve scientific critic reasoning in large language models by focusing on correctness transitions during interactions.
Contribution
It proposes a novel RL approach that decomposes critic behavior into four quadrants, enhancing correction and robustness while reducing harmful sycophancy in scientific reasoning tasks.
Findings
ReCrit improves Critic accuracy significantly on three benchmarks.
Transition-aware rewards outperform final-answer rewards in training.
Dynamic asynchronous rollout reduces interaction training overhead.
Abstract
Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter-turn correctness-transition problem rather than a final-answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition-aware reinforcement learning framework that decomposes Initial-to-Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
