TL;DR
This paper introduces CSRS, a novel method for improving the stability and reasoning accuracy of unsupervised multimodal large language models through continuous reward calibration and retracing mechanisms.
Contribution
The paper proposes CSRS, combining retracing re-inference, continuous reward signals, and visual perturbation to enhance reasoning in MLLMs during self-evolution.
Findings
CSRS significantly improves reasoning performance on benchmarks like MathVision.
Achieves state-of-the-art results in unsupervised self-evolution on geometric tasks.
Code is publicly available at the provided GitHub URL.
Abstract
In the unsupervised self-evolution of Multimodal Large Language Models, the quality of feedback signals during post-training is pivotal for stable and effective learning. However, existing self-evolution methods predominantly rely on majority voting to select the most frequent output as the pseudo-golden answer, which may stem from the model's intrinsic biases rather than guaranteeing the objective correctness of the reasoning paths. To counteract the degradation, we propose Continuous Softened Retracing reSampling (CSRS) in MLLM self-evolution. Specifically, we introduce a Retracing Re-inference Mechanism (RRM) that the model re-inferences from anchor points to expand the exploration of long-tail reasoning paths. Simultaneously, we propose Softened Frequency Reward (SFR), which replaces binary rewards with continuous signals, calibrating reward based on the answers' frequency across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
