Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Qirui Chen; Shangzhe Di; Weidi Xie

arXiv:2408.14469·cs.CV·August 27, 2024

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Qirui Chen, Shangzhe Di, Weidi Xie

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a new multi-hop VideoQA task in egocentric videos, develops a dataset and benchmark, and proposes a novel model that improves multi-hop reasoning and grounding in long-form videos.

Contribution

It creates a large-scale dataset and benchmark for multi-hop VideoQA in egocentric videos and proposes GeLM, a model enhancing multi-modal reasoning with a grounding module.

Findings

01

Existing systems have limited multi-hop reasoning abilities.

02

GeLM improves multi-hop grounding and reasoning performance.

03

The model achieves state-of-the-art results on ActivityNet-RTL.

Abstract

This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences. We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence, enabling to construct a large-scale dataset for instruction-tuning. To monitor the progress of this new task, we further curate a high-quality benchmark, MultiHop-EgoQA, with careful manual verification and refinement. Experimental results reveal that existing multi-modal systems exhibit inadequate multi-hop grounding and reasoning abilities, resulting in unsatisfactory performance. We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (GeLM), that enhances multi-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qirui-chen/MultiHop-EgoQA
pytorch

Videos

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos· underline

Taxonomy

TopicsMultimedia Communication and Technology