ReasVQA: Advancing VideoQA with Imperfect Reasoning Process

Jianxin Liang; Xiaojun Meng; Huishuai Zhang; Yueqian Wang; Jiansheng; Wei; Dongyan Zhao

arXiv:2501.13536·cs.CV·January 24, 2025

ReasVQA: Advancing VideoQA with Imperfect Reasoning Process

Jianxin Liang, Xiaojun Meng, Huishuai Zhang, Yueqian Wang, Jiansheng, Wei, Dongyan Zhao

PDF

Open Access 1 Video

TL;DR

ReasVQA introduces a reasoning-enhanced approach using multimodal large language models to significantly improve VideoQA performance across multiple benchmarks.

Contribution

The paper presents a novel multi-phase method leveraging reasoning generated by MLLMs to enhance VideoQA accuracy and establish new state-of-the-art results.

Findings

01

Achieved new state-of-the-art results on three VideoQA benchmarks.

02

Demonstrated the effectiveness of reasoning supervision in VideoQA.

03

Validated the robustness of the approach with different backbones and MLLMs.

Abstract

Video Question Answering (VideoQA) is a challenging task that requires understanding complex visual and temporal relationships within videos to answer questions accurately. In this work, we introduce \textbf{ReasVQA} (Reasoning-enhanced Video Question Answering), a novel approach that leverages reasoning processes generated by Multimodal Large Language Models (MLLMs) to improve the performance of VideoQA models. Our approach consists of three phases: reasoning generation, reasoning refinement, and learning from reasoning. First, we generate detailed reasoning processes using additional MLLMs, and second refine them via a filtering step to ensure data quality. Finally, we use the reasoning data, which might be in an imperfect form, to guide the VideoQA model via multi-task learning, on how to interpret and answer questions based on a given video. We evaluate ReasVQA on three popular…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ReasVQA: Advancing VideoQA with Imperfect Reasoning Process· underline

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications