VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks
Xinlong Chen, Yuanxing Zhang, Yushuo Guan, Weihong Lin, Zekun Wang, Bohan Zeng, Yang Shi, Sihan Yang, Qiang Liu, Pengfei Wan, Liang Wang, Tieniu Tan

TL;DR
This paper introduces VidBridge-R1, a novel training framework with intermediate proxy tasks that enables a single video understanding model to excel at both question answering and captioning, overcoming previous performance conflicts.
Contribution
It proposes a new training paradigm with proxy tasks to unify QA and captioning in video models, leading to a versatile and improved video reasoning model.
Findings
VidBridge-R1 outperforms previous models on QA and captioning tasks.
Proxy tasks improve the model's ability to understand and reason about videos.
The framework enhances generalization across multiple video understanding tasks.
Abstract
The "Reason-Then-Respond" paradigm, enhanced by Reinforcement Learning, has shown great promise in advancing Multimodal Large Language Models. However, its application to the video domain has led to specialized models that excel at either question answering (QA) or captioning tasks, but struggle to master both. Naively combining reward signals from these tasks results in mutual performance degradation, which we attribute to a conflict between their opposing task natures. To address this challenge, we propose a novel training framework built upon two intermediate proxy tasks: DarkEventInfer, which presents videos with masked event segments, requiring models to infer the obscured content based on contextual video cues; and MixVidQA, which presents interleaved video sequences composed of two distinct clips, challenging models to isolate and reason about one while disregarding the other.…
Peer Reviews
Decision·ICLR 2026 Poster
1. Overall, the paper clearly identifies a meaningful optimization conflict between QA and captioning under RL training and motivates why naive multi-task RL leads to mutual degradation. 2. The two proxy tasks (DarkEventInfer and MixVidQA) are intuitively aligned with promoting both holistic contextual reasoning and selective information grounding, and their construction process is described with adequate clarity and filtering steps. 3. The experimental evaluation is comprehensive, covering gene
1. While the proxy tasks appear to be effective, it is still not fully demonstrated "why" these particular tasks are optimal among possible bridging tasks. A brief analysis or comparison with alternative proxy formulations (e.g., temporal reordering tasks, masked key-frame inference) would provide more insights to the readers. 2. The experiments are mainly conducted on Qwen2.5-VL-7B-Instruct. It is encouraged to discuss how generalizable the method is to stronger or smaller video-language backbo
- Well motivated problem statement and intuitive proposals - Sota performance on both captioning and QA tasks
- The ablation in Table 3 eludes important rows showing the benefit of the proposed tasks together with the caption task, as well as the row with VidMixQA and not DarkEventInfer - Each task being based on different data makes it difficult to disentangle the benefits of the task vs the data
1. The idea sounds reasonable. Effective data augmentation methods are always needed. 2. The writing is clear. 3. The presented results are promising.
1. As an data augmentation method, whether the method is scalable is questionable. 2. Lack of detailed analysis, why the reasoning models show poor performance on video general understanding and captioning tasks, but the model trained with the augmented reasoning data bring clear improvement. Is this a specific kind of overfitting or data leakage?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
