HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
Joungbin An, Kristen Grauman

TL;DR
HieraMamba introduces a hierarchical architecture with Anchor-MambaPooling blocks for precise video temporal grounding, effectively capturing both global context and fine-grained details in long videos.
Contribution
The paper proposes a novel hierarchical model with AMP blocks and contrastive losses to improve temporal localization accuracy in untrimmed videos.
Findings
Sets new state-of-the-art on Ego4D-NLQ, MAD, and TACoS datasets.
Effectively preserves temporal fidelity across multiple scales.
Achieves precise localization in long, untrimmed videos.
Abstract
Video temporal grounding, the task of localizing the start and end times of a natural language query in untrimmed video, requires capturing both global context and fine-grained temporal detail. This challenge is particularly pronounced in long videos, where existing methods often compromise temporal fidelity by over-downsampling or relying on fixed windows. We present HieraMamba, a hierarchical architecture that preserves temporal structure and semantic richness across scales. At its core are Anchor-MambaPooling (AMP) blocks, which utilize Mamba's selective scanning to produce compact anchor tokens that summarize video content at multiple granularities. Two complementary objectives, anchor-conditioned and segment-pooled contrastive losses, encourage anchors to retain local detail while remaining globally discriminative. HieraMamba sets a new state-of-the-art on Ego4D-NLQ, MAD, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
