VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking
Desen Meng, Rui Huang, Zhilin Dai, Xinhao Li, Yifan Xu, Jun Zhang, Zhenpeng Huang, Meng Zhang, Lingshu Zhang, Yi Liu, Limin Wang

TL;DR
This paper introduces VideoCap-R1, a structured thinking approach with reinforcement learning to improve video captioning in multi-modal large language models, leading to more accurate action descriptions.
Contribution
It pioneers the use of GRPO-based RL post-training with structured reasoning for video captioning in MLLMs, demonstrating significant performance improvements.
Findings
Achieved +4.4 event F1 on DREAM1K
Improved +4.2 accuracy on VDC
Enhanced +3.1 action F1 and +6.9 object F1 on CAREBENCH
Abstract
While recent advances in reinforcement learning have significantly enhanced reasoning capabilities in large language models (LLMs), these techniques remain underexplored in multi-modal LLMs for video captioning. This paper presents the first systematic investigation of GRPO-based RL post-training for video MLLMs, with the goal of enhancing video MLLMs' capability of describing actions in videos. Specifically, we develop the VideoCap-R1, which is prompted to first perform structured thinking that analyzes video subjects with their attributes and actions before generating complete captions, supported by two specialized reward mechanisms: a LLM-free think scorer evaluating the structured thinking quality and a LLM-assisted caption scorer assessing the output quality. The RL training framework effectively establishes the connection between structured reasoning and comprehensive description…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The proposed structured reasoning framework provides a practical RL-based paradigm for video captioning, enabling the model to focus on key subjects, attributes, and actions in videos and thereby generate more comprehensive and accurate captions.
1. The performance improvement is relatively limited — VideoCap-R1 underperforms Tarsier on the DREAM-1K benchmark and remains notably behind proprietary models such as Gemini-1.5-Pro. 2. The comparison set lacks stronger and more recent baselines (e.g., Gemini-2.5-Pro/Flash, Qwen2.5-VL, InternVL3/3.5), making it difficult to assess competitiveness against state-of-the-art models. 3. The evaluation is confined to only three benchmarks; given that VideoCap-R1 emphasizes structured reasoning for c
1. **Method design:** The two-stage framework combining structured reasoning and caption generation is well-motivated, and the use of GRPO-based training with carefully designed reward functions is conceptually sound and aligns naturally with the task. 2. **Strong results:** The proposed model demonstrates consistent and substantial improvements across multiple video captioning benchmarks, validating both the effectiveness and data efficiency of the approach.
1. **Effectiveness of CNscore:** The ablation results indicate that Escore is generally better than CNscore, making the contribution of introducing CNscore less clear. It remains uncertain what unique benefit CNscore brings given that a more effective alternative already exists. 2. **Overfitting to captioning tasks:** While the method achieves strong performance in video captioning, the design choices, such as structured reasoning tailored for description generation, appear highly specific to th
- S1. [Idea] The basic idea of VideoCap-R1 is to two-stage caption generation that first performs structured reasoning and then synthesizes output captions. This seams intuitive and effective. - S2. [Solution] The authors successfully applies GRPO to MLLM-based video captioning.
- W1. [Technical soundness] One of main contributions of this paper is a caption reward design. It consists of a LLM-free thinker scorer (Tscore) and a LLM-assistant caption scorer (CNscore and Escore). However, they seem rather heuristic and do not perform consistently. According to the ablation study in Table 3, Tscore increases the average score by 1.6 compared to the baseline. Also, even though Tscore + Escore provide the highest score, Tscore + CNscore provide a better score on specific ben
1. Innovative Structured Reasoning Framework: The two-stage generation process (structured thinking → caption) effectively bridges fine-grained visual perception and fluent description, leading to more accurate and detailed captions—especially for dynamic actions. 2. Effective Reward Design for Open-Ended Tasks: The combination of an LLM-free think scorer (based on attribute/action F1) and an event-based LLM-assisted caption scorer provides robust, objective signals for RL training, successfull
1. The absolute performance is not particularly strong: DREAM-1K achieves only an F1 score of 34.2, and VDC attains an accuracy of merely 43.8—metrics that many similarly sized general-purpose models can also reach. In this context, the claimed effectiveness of VideoCap-R1’s caption-specific optimizations is not convincingly demonstrated. 2. The hyperparameters “δ₁ = 0.28, δ₂ = 0.35” appear highly fine-tuned; the paper does not clarify how they were determined. Such precise tuning raises concer
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Subtitles and Audiovisual Media
