HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding

Haopeng Jin; Hongzhu Yi; Wenlong Zhao; Jinwen Luo; Shani Ye; Zhenyu Guan; Shiquan Dong; Tiankun Yang; Tao Yu

arXiv:2605.08158·cs.CV·May 12, 2026

HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding

Haopeng Jin, Hongzhu Yi, Wenlong Zhao, Jinwen Luo, Shani Ye, Zhenyu Guan, Shiquan Dong, Tiankun Yang, Tao Yu

PDF

TL;DR

HY-Himmel introduces a hierarchical multi-stream encoding framework that improves long video understanding efficiency and motion perception by separating semantic and motion processing and using a lightweight tri-stream adapter.

Contribution

The paper proposes a novel hierarchical framework with a tri-stream adapter for efficient, dense motion encoding in long videos, surpassing dense baselines with fewer tokens.

Findings

01

HY-Himmel outperforms dense 32-frame baseline by +2.3 percentage points.

02

Uses 3.6x fewer context tokens than dense methods.

03

Full tri-stream encoding is necessary for observed performance gains.

Abstract

Long-video understanding with multimodal language models suffers from three compounding bottlenecks: heavy decode cost to obtain dense RGB frames, quadratic token growth with frame count, and weak motion perception under sparse keyframe sampling. We present HY-Himmel, a hierarchical video-language framework that allocates semantic and motion capacity separately. A small set of sparse anchor I-frames is routed to the expensive host ViT to ground object identity and scene layout, while the far denser inter-frame intervals are encoded by a lightweight compressed-domain tri-stream adapter that distils motion evidence from motion-vector maps, residual maps, and I-frame context into aligned motion tokens. These tokens are injected into the LLM via a differentiable placeholder mechanism after a dedicated Stage-1 contrastive alignment that places the motion representation in a geometry…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.