HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling

Joungbin An; Kristen Grauman

arXiv:2510.23043·cs.CV·April 1, 2026

HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling

Joungbin An, Kristen Grauman

PDF

TL;DR

HieraMamba introduces a hierarchical architecture with Anchor-MambaPooling blocks for precise video temporal grounding, effectively capturing both global context and fine-grained details in long videos.

Contribution

The paper proposes a novel hierarchical model with AMP blocks and contrastive losses to improve temporal localization accuracy in untrimmed videos.

Findings

01

Sets new state-of-the-art on Ego4D-NLQ, MAD, and TACoS datasets.

02

Effectively preserves temporal fidelity across multiple scales.

03

Achieves precise localization in long, untrimmed videos.

Abstract

Video temporal grounding, the task of localizing the start and end times of a natural language query in untrimmed video, requires capturing both global context and fine-grained temporal detail. This challenge is particularly pronounced in long videos, where existing methods often compromise temporal fidelity by over-downsampling or relying on fixed windows. We present HieraMamba, a hierarchical architecture that preserves temporal structure and semantic richness across scales. At its core are Anchor-MambaPooling (AMP) blocks, which utilize Mamba's selective scanning to produce compact anchor tokens that summarize video content at multiple granularities. Two complementary objectives, anchor-conditioned and segment-pooled contrastive losses, encourage anchors to retain local detail while remaining globally discriminative. HieraMamba sets a new state-of-the-art on Ego4D-NLQ, MAD, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.