Tripping through time: Efficient Localization of Activities in Videos

Meera Hahn; Asim Kadav; James M. Rehg; Hans Peter Graf

arXiv:1904.09936·cs.CV·August 19, 2020·41 cites

Tripping through time: Efficient Localization of Activities in Videos

Meera Hahn, Asim Kadav, James M. Rehg, Hans Peter Graf

PDF

Open Access

TL;DR

This paper introduces TripNet, an efficient end-to-end system for localizing activities in videos using language queries, which reduces processing time by selectively skipping parts of long videos while maintaining high accuracy.

Contribution

TripNet is the first system to combine gated attention and reinforcement learning for efficient, accurate activity localization in untrimmed videos.

Findings

01

TripNet achieves high accuracy on multiple datasets.

02

It processes only 32-41% of videos, saving time.

03

It effectively aligns textual and visual content.

Abstract

Localizing moments in untrimmed videos via language queries is a new and interesting task that requires the ability to accurately ground language into video. Previous works have approached this task by processing the entire video, often more than once, to localize relevant activities. In the real world applications of this approach, such as video surveillance, efficiency is a key system requirement. In this paper, we present TripNet, an end-to-end system that uses a gated attention architecture to model fine-grained textual and visual representations in order to align text and video content. Furthermore, TripNet uses reinforcement learning to efficiently localize relevant activity clips in long videos, by learning how to intelligently skip around the video. It extracts visual features for few frames to perform activity classification. In our evaluation over Charades-STA, ActivityNet…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization