HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning
Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim

TL;DR
This paper introduces HiCM$^2$, a hierarchical compact memory model inspired by human cognition, which enhances dense video captioning by improving memory recall and achieving state-of-the-art results on benchmark datasets.
Contribution
The paper proposes a novel hierarchical memory structure and reading module for dense video captioning, inspired by human memory hierarchy, with clustering and summarization techniques.
Findings
Achieves state-of-the-art performance on YouCook2 dataset.
Improves dense video captioning accuracy through hierarchical memory recall.
Demonstrates effectiveness of memory clustering and summarization methods.
Abstract
With the growing demand for solutions to real-world video challenges, interest in dense video captioning (DVC) has been on the rise. DVC involves the automatic captioning and localization of untrimmed videos. Several studies highlight the challenges of DVC and introduce improved methods utilizing prior knowledge, such as pre-training and external memory. In this research, we propose a model that leverages the prior knowledge of human-oriented hierarchical compact memory inspired by human memory hierarchy and cognition. To mimic human-like memory recall, we construct a hierarchical memory and a hierarchical memory reading module. We build an efficient hierarchical compact memory by employing clustering of memory events and summarization using large language models. Comparative experiments demonstrate that this hierarchical memory recall process improves the performance of DVC by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
