Perceive, Query & Reason: Enhancing Video QA with Question-Guided   Temporal Queries

Roberto Amoroso; Gengyuan Zhang; Rajat Koner; Lorenzo Baraldi; Rita; Cucchiara; Volker Tresp

arXiv:2412.19304·cs.CV·December 30, 2024

Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries

Roberto Amoroso, Gengyuan Zhang, Rajat Koner, Lorenzo Baraldi, Rita, Cucchiara, Volker Tresp

PDF

Open Access

TL;DR

This paper introduces T-Former, a novel temporal modeling approach that enhances Video QA by creating question-guided temporal bridges, improving the integration of visual and textual reasoning in multimodal large language models.

Contribution

The paper proposes T-Former, a new temporal modeling technique that effectively aligns visual perception with reasoning in Video QA tasks using question-guided temporal modeling.

Findings

01

T-Former outperforms existing temporal models on multiple benchmarks.

02

It effectively leverages question-guided temporal information for better video understanding.

03

Aligns visual perception with reasoning capabilities in multimodal models.

Abstract

Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos, identify the most relevant information based on contextual cues from a given question, and reason accurately to provide answers. Recent advancements in Multimodal Large Language Models (MLLMs) have transformed video QA by leveraging their exceptional commonsense reasoning capabilities. This progress is largely driven by the effective alignment between visual data and the language space of MLLMs. However, for video QA, an additional space-time alignment poses a considerable challenge for extracting question-relevant information across frames. In this work, we investigate diverse temporal modeling techniques to integrate with MLLMs, aiming to achieve question-guided temporal modeling that leverages pre-trained visual and textual alignment in MLLMs. We propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Data Compression Techniques · Image Retrieval and Classification Techniques