ReasVQA: Advancing VideoQA with Imperfect Reasoning Process
Jianxin Liang, Xiaojun Meng, Huishuai Zhang, Yueqian Wang, Jiansheng, Wei, Dongyan Zhao

TL;DR
ReasVQA introduces a reasoning-enhanced approach using multimodal large language models to significantly improve VideoQA performance across multiple benchmarks.
Contribution
The paper presents a novel multi-phase method leveraging reasoning generated by MLLMs to enhance VideoQA accuracy and establish new state-of-the-art results.
Findings
Achieved new state-of-the-art results on three VideoQA benchmarks.
Demonstrated the effectiveness of reasoning supervision in VideoQA.
Validated the robustness of the approach with different backbones and MLLMs.
Abstract
Video Question Answering (VideoQA) is a challenging task that requires understanding complex visual and temporal relationships within videos to answer questions accurately. In this work, we introduce \textbf{ReasVQA} (Reasoning-enhanced Video Question Answering), a novel approach that leverages reasoning processes generated by Multimodal Large Language Models (MLLMs) to improve the performance of VideoQA models. Our approach consists of three phases: reasoning generation, reasoning refinement, and learning from reasoning. First, we generate detailed reasoning processes using additional MLLMs, and second refine them via a filtering step to ensure data quality. Finally, we use the reasoning data, which might be in an imperfect form, to guide the VideoQA model via multi-task learning, on how to interpret and answer questions based on a given video. We evaluate ReasVQA on three popular…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications
