Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos
Qirui Chen, Shangzhe Di, Weidi Xie

TL;DR
This paper introduces a new multi-hop VideoQA task in egocentric videos, develops a dataset and benchmark, and proposes a novel model that improves multi-hop reasoning and grounding in long-form videos.
Contribution
It creates a large-scale dataset and benchmark for multi-hop VideoQA in egocentric videos and proposes GeLM, a model enhancing multi-modal reasoning with a grounding module.
Findings
Existing systems have limited multi-hop reasoning abilities.
GeLM improves multi-hop grounding and reasoning performance.
The model achieves state-of-the-art results on ActivityNet-RTL.
Abstract
This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences. We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence, enabling to construct a large-scale dataset for instruction-tuning. To monitor the progress of this new task, we further curate a high-quality benchmark, MultiHop-EgoQA, with careful manual verification and refinement. Experimental results reveal that existing multi-modal systems exhibit inadequate multi-hop grounding and reasoning abilities, resulting in unsatisfactory performance. We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (GeLM), that enhances multi-modal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimedia Communication and Technology
