# Localizing Moments in Long Video Via Multimodal Guidance

**Authors:** Wayner Barrios, Mattia Soldan, Alberto Mario Ceballos-Arroyo, Fabian, Caba Heilbron, Bernard Ghanem

arXiv: 2302.13372 · 2023-10-17

## TL;DR

This paper introduces a multimodal guidance framework that improves long video moment localization by focusing on describable segments, significantly outperforming existing methods on MAD and Ego4D datasets.

## Contribution

It proposes a novel guided grounding approach with guidance models that identify relevant video segments, enhancing natural language grounding in long videos.

## Key findings

- Outperforms state-of-the-art by 4.1% on MAD
- Achieves 4.52% improvement on Ego4D NLQ
- Effectively identifies describable windows in long videos

## Abstract

The recent introduction of the large-scale, long-form MAD and Ego4D datasets has enabled researchers to investigate the performance of current state-of-the-art methods for video grounding in the long-form setup, with interesting findings: current grounding methods alone fail at tackling this challenging task and setup due to their inability to process long video sequences. In this paper, we propose a method for improving the performance of natural language grounding in long videos by identifying and pruning out non-describable windows. We design a guided grounding framework consisting of a Guidance Model and a base grounding model. The Guidance Model emphasizes describable windows, while the base grounding model analyzes short temporal windows to determine which segments accurately match a given language query. We offer two designs for the Guidance Model: Query-Agnostic and Query-Dependent, which balance efficiency and accuracy. Experiments demonstrate that our proposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% in Ego4D (NLQ), respectively. Code, data and MAD's audio features necessary to reproduce our experiments are available at: https://github.com/waybarrios/guidance-based-video-grounding.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.13372/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/2302.13372/full.md

## References

46 references — full list in the complete paper: https://tomesphere.com/paper/2302.13372/full.md

---
Source: https://tomesphere.com/paper/2302.13372