Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding
Minghang Zheng, Yuxin Peng, Benyuan Sun, Yi Yang, Yang Liu

TL;DR
This paper introduces a hierarchical event memory framework for online video temporal grounding, enabling accurate, low-latency event localization in streaming videos by modeling event-level information and preserving long-term memory.
Contribution
It proposes a novel hierarchical event memory and event-based prediction framework for improved online video temporal grounding performance.
Findings
Achieves state-of-the-art results on TACoS, ActivityNet Captions, and MAD datasets.
Effectively models long-term event information in streaming videos.
Enhances real-time event prediction accuracy.
Abstract
In this paper, we tackle the task of online video temporal grounding (OnVTG), which requires the model to locate events related to a given text query within a video stream. Unlike regular video temporal grounding, OnVTG requires the model to make predictions without observing future frames. As online videos are streaming inputs and can go on indefinitely, it is impractical and inefficient to store all historical inputs. The existing OnVTG models employ memory to store recent historical video frame features and predict scores indicating whether the current frame corresponds to the start or end time of the target event. However, these methods lack effective event modeling and cannot retain long-term historical information, leading to low performance. To tackle these challenges, we propose a hierarchical event memory for OnVTG. We propose an event-based OnVTG framework that makes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
