# Weakly Supervised Video Moment Retrieval From Text Queries

**Authors:** Niluthpol Chowdhury Mithun, Sujoy Paul, Amit K. Roy-Chowdhury

arXiv: 1904.03282 · 2019-09-06

## TL;DR

This paper introduces a weakly supervised approach for video moment retrieval from text, learning from only video-text pairs without explicit temporal annotations, and achieves competitive results on benchmark datasets.

## Contribution

The work presents a novel joint visual-semantic embedding framework utilizing Text-Guided Attention for weakly supervised video moment retrieval.

## Key findings

- Achieves comparable performance to fully supervised methods.
- Utilizes latent alignment between video frames and text descriptions.
- Operates effectively with only video-level sentence descriptions.

## Abstract

There have been a few recent methods proposed in text to video moment retrieval using natural language queries, but requiring full supervision during training. However, acquiring a large number of training videos with temporal boundary annotations for each text description is extremely time-consuming and often not scalable. In order to cope with this issue, in this work, we introduce the problem of learning from weak labels for the task of text to video moment retrieval. The weak nature of the supervision is because, during training, we only have access to the video-text pairs rather than the temporal extent of the video to which different text descriptions relate. We propose a joint visual-semantic embedding based framework that learns the notion of relevant segments from video using only video-level sentence descriptions. Specifically, our main idea is to utilize latent alignment between video frames and sentence descriptions using Text-Guided Attention (TGA). TGA is then used during the test phase to retrieve relevant moments. Experiments on two benchmark datasets demonstrate that our method achieves comparable performance to state-of-the-art fully supervised approaches.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.03282/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/1904.03282/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/1904.03282/full.md

---
Source: https://tomesphere.com/paper/1904.03282