VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks

Xinlong Chen; Yuanxing Zhang; Yushuo Guan; Weihong Lin; Zekun Wang; Bohan Zeng; Yang Shi; Sihan Yang; Qiang Liu; Pengfei Wan; Liang Wang; Tieniu Tan

arXiv:2506.09079·cs.CV·September 29, 2025

VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks

Xinlong Chen, Yuanxing Zhang, Yushuo Guan, Weihong Lin, Zekun Wang, Bohan Zeng, Yang Shi, Sihan Yang, Qiang Liu, Pengfei Wan, Liang Wang, Tieniu Tan

PDF

3 Reviews

TL;DR

This paper introduces VidBridge-R1, a novel training framework with intermediate proxy tasks that enables a single video understanding model to excel at both question answering and captioning, overcoming previous performance conflicts.

Contribution

It proposes a new training paradigm with proxy tasks to unify QA and captioning in video models, leading to a versatile and improved video reasoning model.

Findings

01

VidBridge-R1 outperforms previous models on QA and captioning tasks.

02

Proxy tasks improve the model's ability to understand and reason about videos.

03

The framework enhances generalization across multiple video understanding tasks.

Abstract

The "Reason-Then-Respond" paradigm, enhanced by Reinforcement Learning, has shown great promise in advancing Multimodal Large Language Models. However, its application to the video domain has led to specialized models that excel at either question answering (QA) or captioning tasks, but struggle to master both. Naively combining reward signals from these tasks results in mutual performance degradation, which we attribute to a conflict between their opposing task natures. To address this challenge, we propose a novel training framework built upon two intermediate proxy tasks: DarkEventInfer, which presents videos with masked event segments, requiring models to infer the obscured content based on contextual video cues; and MixVidQA, which presents interleaved video sequences composed of two distinct clips, challenging models to isolate and reason about one while disregarding the other.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 5

Strengths

1. Overall, the paper clearly identifies a meaningful optimization conflict between QA and captioning under RL training and motivates why naive multi-task RL leads to mutual degradation. 2. The two proxy tasks (DarkEventInfer and MixVidQA) are intuitively aligned with promoting both holistic contextual reasoning and selective information grounding, and their construction process is described with adequate clarity and filtering steps. 3. The experimental evaluation is comprehensive, covering gene

Weaknesses

1. While the proxy tasks appear to be effective, it is still not fully demonstrated "why" these particular tasks are optimal among possible bridging tasks. A brief analysis or comparison with alternative proxy formulations (e.g., temporal reordering tasks, masked key-frame inference) would provide more insights to the readers. 2. The experiments are mainly conducted on Qwen2.5-VL-7B-Instruct. It is encouraged to discuss how generalizable the method is to stronger or smaller video-language backbo

Reviewer 02Rating 6Confidence 3

Strengths

- Well motivated problem statement and intuitive proposals - Sota performance on both captioning and QA tasks

Weaknesses

- The ablation in Table 3 eludes important rows showing the benefit of the proposed tasks together with the caption task, as well as the row with VidMixQA and not DarkEventInfer - Each task being based on different data makes it difficult to disentangle the benefits of the task vs the data

Reviewer 03Rating 4Confidence 3

Strengths

1. The idea sounds reasonable. Effective data augmentation methods are always needed. 2. The writing is clear. 3. The presented results are promising.

Weaknesses

1. As an data augmentation method, whether the method is scalable is questionable. 2. Lack of detailed analysis, why the reasoning models show poor performance on video general understanding and captioning tasks, but the model trained with the augmented reasoning data bring clear improvement. Is this a specific kind of overfitting or data leakage?

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.