HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models
Haoran Li, Yingjie Qin, Baoyuan Ou, Lai Xu, Ruiwen Xu

TL;DR
HoPE introduces a hybrid position embedding method with dynamic temporal scaling to enhance long-context understanding in vision-language models, significantly improving performance on long video tasks.
Contribution
This paper proposes HoPE, a novel hybrid position embedding strategy with dynamic temporal scaling, specifically designed to improve long-context modeling in vision-language models.
Findings
HoPE outperforms existing methods on four long video benchmarks.
The hybrid frequency allocation improves semantic similarity capture over long contexts.
Dynamic temporal scaling enhances robustness across diverse context lengths.
Abstract
Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
