Infusing Environmental Captions for Long-Form Video Language Grounding

Hyogun Lee; Soyeon Hong; Mujeen Sung; Jinwoo Choi

arXiv:2408.02336·cs.CV·August 7, 2024

Infusing Environmental Captions for Long-Form Video Language Grounding

Hyogun Lee, Soyeon Hong, Mujeen Sung, Jinwoo Choi

PDF

Open Access

TL;DR

This paper introduces EI-VLG, a novel long-form video-language grounding method that uses multi-modal large language models to incorporate richer textual context, improving the accuracy of localizing relevant video segments based on natural language queries.

Contribution

The paper proposes EI-VLG, a new approach that leverages multi-modal large language models to better exclude irrelevant frames in long-form video-language grounding tasks.

Findings

01

EI-VLG outperforms existing methods on EgoNLQ benchmark.

02

Using richer textual information improves frame relevance detection.

03

The approach reduces superficial cue reliance in VLG.

Abstract

In this work, we tackle the problem of long-form video-language grounding (VLG). Given a long-form video and a natural language query, a model should temporally localize the precise moment that answers the query. Humans can easily solve VLG tasks, even with arbitrarily long videos, by discarding irrelevant moments using extensive and robust knowledge gained from experience. Unlike humans, existing VLG methods are prone to fall into superficial cues learned from small-scale datasets, even when they are within irrelevant frames. To overcome this challenge, we propose EI-VLG, a VLG method that leverages richer textual information provided by a Multi-modal Large Language Model (MLLM) as a proxy for human experiences, helping to effectively exclude irrelevant frames. We validate the effectiveness of the proposed method via extensive experiments on a challenging EgoNLQ benchmark.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Multimodal Machine Learning Applications · Natural Language Processing Techniques