GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, Xihui Liu

TL;DR
This paper introduces GRPO-CARE, a reinforcement learning framework that enhances multimodal language models by improving reasoning consistency and accuracy, validated on a new complex video reasoning benchmark.
Contribution
It proposes a novel consistency-aware RL method that balances answer correctness and reasoning coherence without explicit supervision.
Findings
GRPO-CARE outperforms standard GRPO in accuracy and consistency.
It achieves a 6.7% performance boost on the hardest benchmark level.
The framework improves transferability across diverse video understanding tasks.
Abstract
Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts, and strict KL penalties limiting exploration.To…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper identifies an important direction of improving multimodal reasoning capabilities through consistency reward. However the paper could be strengthened by more experiments and analysis as mentioned below.
1. The paper could be strengthened by providing analysis on how the weighting between correctness reward and consistency reward could affect model performance 2. The vast majority of the paper's experiments, analyses, and conclusions are based on a single model architecture : Qwen2.5-VL. The effectiveness of the proposed method could be further tested on other model architectures and scales.
**Originality**: - The hierarchical design of SEED-Bench-R1 addresses a critical gap in evaluating MLLM generalization across controlled OOD scenarios. - GRPO-CARE’s use of an EMA reference model for likelihood calibration and group-relative sparse rewards is a creative adaptation of existing RL principles to enforce logical consistency. **Quality**: - The ablation studies (Tables 3–5) rigorously validate GRPO-CARE’s components (e.g., EMA reference, two-stage filtering). - Transfer
**1. Limited Benchmark Scope**: - SEED-Bench-R1 focuses on video understanding but does not cover other critical multimodal domains (e.g., audio-visual tasks, interactive environments). While the authors mention future expansion, the current narrow scope limits the benchmark’s utility for broader MLLM research. **2. Reliance on GPT-4 for Consistency Evaluation**: - The consistency metric (Figure 5) depends on GPT-4.1 judgments, which may inherit biases or errors from the LLM. The paper
- The paper approaches an important problem: how to improve the model's multimodal reasoning capability by designing better RL algorithms. - The paper proposes a novel algorithm, GRPO-care, and a new validation benchmark, Seed-Bench-R1, which is beneficial to the general multimodal learning community. - The motivation is clear, and the considered approach is reasonable, with decent performance improvement over other baseline methods over the proposed Seed-Bench-R1 benchmark. - The listed and c
- The performance improvement of GRPO-CARE on the general video understanding benchmark is not significant. Also, the other baselines, such as Video-R1-7B, are not evaluated on Seed-Bench-R1, which makes the reported results slightly less convincing. - The idea of using a slowly updated EMA reference model for calibrating rollout reasoning trace log probability is interesting, but the paper lacks an in-depth investigation of why this approach works better than other options. A more systematic st
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
