TL;DR
This paper introduces VideoThinker, a causal-inspired debiasing framework for lightweight video reasoning models that improves generalization by actively reducing perceptual shortcut biases.
Contribution
It proposes a novel two-stage debiasing method, including bias modeling and causal policy optimization, to enhance reasoning in lightweight models without extensive fine-tuning.
Findings
VideoThinker-R1 achieves state-of-the-art efficiency in video reasoning.
It surpasses larger models on multiple benchmarks with minimal training data.
The approach effectively reduces perceptual bias and improves generalization.
Abstract
Although reinforcement learning (RL) has significantly advanced reasoning capabilities in large multimodal language models (MLLMs), its efficacy remains limited for lightweight models essential for edge deployments. To address this issue, we leverage causal analysis and experiment to reveal the underlying phenomenon of perceptual bias, demonstrating that RL-based fine-tuning compels lightweight models to preferentially adopt perceptual shortcuts induced by data biases, rather than developing genuine reasoning abilities. Motivated by this insight, we propose VideoThinker, a causal-inspired framework that cultivates robust reasoning in lightweight models through a two-stage debiasing process. First, the Bias Aware Training stage forges a dedicated "bias model" to embody these shortcut behaviors. Then, the Causal Debiasing Policy Optimization (CDPO) algorithm fine-tunes the primary model,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
