Read, Watch, and Move: Reinforcement Learning for Temporally Grounding   Natural Language Descriptions in Videos

Dongliang He; Xiang Zhao; Jizhou Huang; Fu Li; Xiao Liu; Shilei Wen

arXiv:1901.06829·cs.CV·January 23, 2019·19 cites

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, Shilei Wen

PDF

Open Access 1 Repo

TL;DR

This paper introduces a reinforcement learning framework for video grounding that efficiently localizes natural language descriptions in videos by sequentially adjusting boundaries, achieving state-of-the-art results with minimal clip observations.

Contribution

It formulates video grounding as a sequential decision process and leverages reinforcement learning with multi-task training to improve efficiency and accuracy.

Findings

01

Achieves state-of-the-art performance on ActivityNet'18 DenseCaption dataset.

02

Outperforms existing methods while observing only 10 or fewer clips per video.

03

Demonstrates the effectiveness of reinforcement learning in temporal language grounding.

Abstract

The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a pre-segmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-of-the-art performance on ActivityNet'18 DenseCaption dataset and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

WuJie1010/Temporally-language-grounding
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization