Tripping through time: Efficient Localization of Activities in Videos
Meera Hahn, Asim Kadav, James M. Rehg, Hans Peter Graf

TL;DR
This paper introduces TripNet, an efficient end-to-end system for localizing activities in videos using language queries, which reduces processing time by selectively skipping parts of long videos while maintaining high accuracy.
Contribution
TripNet is the first system to combine gated attention and reinforcement learning for efficient, accurate activity localization in untrimmed videos.
Findings
TripNet achieves high accuracy on multiple datasets.
It processes only 32-41% of videos, saving time.
It effectively aligns textual and visual content.
Abstract
Localizing moments in untrimmed videos via language queries is a new and interesting task that requires the ability to accurately ground language into video. Previous works have approached this task by processing the entire video, often more than once, to localize relevant activities. In the real world applications of this approach, such as video surveillance, efficiency is a key system requirement. In this paper, we present TripNet, an end-to-end system that uses a gated attention architecture to model fine-grained textual and visual representations in order to align text and video content. Furthermore, TripNet uses reinforcement learning to efficiently localize relevant activity clips in long videos, by learning how to intelligently skip around the video. It extracts visual features for few frames to perform activity classification. In our evaluation over Charades-STA, ActivityNet…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
