State-Space Hierarchical Compression with Gated Attention and Learnable Sampling for Hour-Long Video Understanding in Large Multimodal Models

Geewook Kim; Minjoon Seo

arXiv:2506.13564·cs.CV·February 10, 2026

State-Space Hierarchical Compression with Gated Attention and Learnable Sampling for Hour-Long Video Understanding in Large Multimodal Models

Geewook Kim, Minjoon Seo

PDF

Open Access 2 Models 1 Video

TL;DR

This paper introduces a hierarchical video compression method using state-space models with gated attention and learnable sampling, enabling efficient hour-long video understanding in large multimodal models.

Contribution

It presents a novel state-space hierarchical compression framework that significantly reduces token usage while maintaining performance in long video tasks.

Findings

01

Achieves competitive results on hour-long video benchmarks.

02

Reduces token budget substantially compared to existing methods.

03

Demonstrates scalability and generality across multiple datasets.

Abstract

We propose an efficient framework to compress massive video-frame features before feeding them into large multimodal models, thereby mitigating the severe token explosion arising from hour-long videos. Our design leverages a bidirectional state-space model equipped with a gated skip connection and a learnable weighted-average pooling mechanism applied to periodically inserted learned queries. This structure enables hierarchical downsampling across both spatial and temporal dimensions, preserving performance in a cost-effective manner. Across challenging hour-long video understanding tasks, our approach demonstrates competitive results against state-of-the-art models, while significantly reducing overall token budget. Notably, replacing our state-space model with conventional modules results in substantial performance degradation, highlighting the advantages of the proposed state-space…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

State-Space Hierarchical Compression with Gated Attention and Learnable Sampling for Hour-Long Video Understanding in Large Multimodal Models· underline

Taxonomy

TopicsAdvanced Data Compression Techniques · Video Analysis and Summarization · Music and Audio Processing

MethodsDropout · Dense Connections · Absolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Transformer