HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding
Haopeng Jin, Hongzhu Yi, Wenlong Zhao, Jinwen Luo, Shani Ye, Zhenyu Guan, Shiquan Dong, Tiankun Yang, Tao Yu

TL;DR
HY-Himmel introduces a hierarchical multi-stream encoding framework that improves long video understanding efficiency and motion perception by separating semantic and motion processing and using a lightweight tri-stream adapter.
Contribution
The paper proposes a novel hierarchical framework with a tri-stream adapter for efficient, dense motion encoding in long videos, surpassing dense baselines with fewer tokens.
Findings
HY-Himmel outperforms dense 32-frame baseline by +2.3 percentage points.
Uses 3.6x fewer context tokens than dense methods.
Full tri-stream encoding is necessary for observed performance gains.
Abstract
Long-video understanding with multimodal language models suffers from three compounding bottlenecks: heavy decode cost to obtain dense RGB frames, quadratic token growth with frame count, and weak motion perception under sparse keyframe sampling. We present HY-Himmel, a hierarchical video-language framework that allocates semantic and motion capacity separately. A small set of sparse anchor I-frames is routed to the expensive host ViT to ground object identity and scene layout, while the far denser inter-frame intervals are encoded by a lightweight compressed-domain tri-stream adapter that distils motion evidence from motion-vector maps, residual maps, and I-frame context into aligned motion tokens. These tokens are injected into the LLM via a differentiable placeholder mechanism after a dedicated Stage-1 contrastive alignment that places the motion representation in a geometry…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
