HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics
Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung-Ting Su, Shang-Hong Lai, Winston H. Hsu

TL;DR
HERMES is a novel framework for long-form video understanding that combines efficient episodic compression and semantic retrieval to improve accuracy and reduce computational costs, outperforming existing models.
Contribution
This paper introduces HERMES, a new approach with two modules that enhance long-video understanding by efficiently capturing temporal and semantic information, adaptable as a standalone or integrated system.
Findings
Reduces inference latency by up to 43%
Decreases memory usage by 46%
Achieves state-of-the-art results on multiple benchmarks
Abstract
Long-form video understanding presents unique challenges that extend beyond traditional short-video analysis approaches, particularly in capturing long-range dependencies, processing redundant information efficiently, and extracting high-level semantic concepts. To address these challenges, we propose a novel approach that more accurately reflects human cognition. This paper introduces HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics, featuring two versatile modules that can enhance existing video-language models or operate as a standalone system. Our Episodic COmpressor (ECO) efficiently aggregates representations from micro to semi-macro levels, reducing computational overhead while preserving temporal dependencies. Our Semantics ReTRiever (SeTR) enriches these representations with semantic information by focusing on broader context, dramatically reducing…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The authors enhance Q-former-based models, such as MA-LMM, by integrating episodic and semantic information. The architecture is clearly defined and appears to be replicable. - The paper reports impressive performance metrics across various long-video benchmarks, indicating that HERMES effectively addresses the complexities inherent in long-form video content.
- While the use of memory mechanisms and temporal information compression is valuable, the approach does not significantly advance the current state of the art. The paper primarily builds on the MA-LMM framework, with its main innovations centered around the memory components. The concepts introduced in the Episodic COmpressor and Semantics reTRiever share similarities with the memory bank and mechanisms used in MA-LMM. Furthermore, comparable memory structures are present in MovieChat, which di
1. The analysis of episodic and semantic memory makes sense and is a promising way to address long video understanding. 2. The problem definition is clear and easy to follow. 3. The architecture shows significant improvement in inference speed compared to existing memory-augmented video LLM.
1. The architecture design cannot reflect the authors' analysis of episodic and semantic memory. Simply using QFormer or token merging strategies can compress video tokens, but is not sufficient to construct the structural memory. 2. Also, the authors are expected to show some visualizations of the token merging in semantic retrieval to show the structure of the semantic memory compression. 3. The results on more recent long video benchmarks, e.g., VideoMME, MLVU, LVBench, etc, are desired.
-The idea of using "Episodes and Semantics" for understanding videos is novel. -The technical contribution of a new multi-modal LLM for video understanding is solid. -The experimental results are convincing and state-of-the-art (SOTA).
-In Section 3, the authors claim that "the core ideas of episodic memory compression (ECO) and semantic knowledge retrieval (SeTR) can be applied to other models," but they do not conduct experiments to support this claim. -Figure 6 only shows the good cases, while I am curious about the failure cases that cannot be handled by the ECO+SeTR approach. -Some recent works on video understanding are missing. For example, the paradigm of 2D CNN + temporal modeling was popular between 2019-2022, and
1. The writing in the paper is clear. I strongly resonate with the introduction regarding episodes and semantics. 2. The ablation study is particularly thorough, with tables 4 and 5 highlighting the significant impact of ECO and SeTR on the model's performance.
1. Figure 1 is misleading and somewhat overstated. From my perspective, the paper primarily focuses on token merging and compression. However, Figure 1 gives the impression that this work can develop an understanding of semantics and episodes at the language level. 2. The distinction between ToMe and the discussed approach is unclear. In fact, Section 3.4 closely resembles ToMe, where ToMe is executed just once. 3. The comparisons made are somewhat outdated. There have been several recent deve
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Topic Modeling
