Infusing Environmental Captions for Long-Form Video Language Grounding
Hyogun Lee, Soyeon Hong, Mujeen Sung, Jinwoo Choi

TL;DR
This paper introduces EI-VLG, a novel long-form video-language grounding method that uses multi-modal large language models to incorporate richer textual context, improving the accuracy of localizing relevant video segments based on natural language queries.
Contribution
The paper proposes EI-VLG, a new approach that leverages multi-modal large language models to better exclude irrelevant frames in long-form video-language grounding tasks.
Findings
EI-VLG outperforms existing methods on EgoNLQ benchmark.
Using richer textual information improves frame relevance detection.
The approach reduces superficial cue reliance in VLG.
Abstract
In this work, we tackle the problem of long-form video-language grounding (VLG). Given a long-form video and a natural language query, a model should temporally localize the precise moment that answers the query. Humans can easily solve VLG tasks, even with arbitrarily long videos, by discarding irrelevant moments using extensive and robust knowledge gained from experience. Unlike humans, existing VLG methods are prone to fall into superficial cues learned from small-scale datasets, even when they are within irrelevant frames. To overcome this challenge, we propose EI-VLG, a VLG method that leverages richer textual information provided by a Multi-modal Large Language Model (MLLM) as a proxy for human experiences, helping to effectively exclude irrelevant frames. We validate the effectiveness of the proposed method via extensive experiments on a challenging EgoNLQ benchmark.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Multimodal Machine Learning Applications · Natural Language Processing Techniques
