VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, Yu Qiao, Yali Wang, Limin Wang

TL;DR
This paper introduces VideoChat-Flash, a hierarchical compression method for long-context video modeling that significantly reduces computational load while maintaining performance, enabling advanced multimodal large language models to process extensive videos effectively.
Contribution
The paper presents a novel hierarchical video token compression technique, a multi-stage training scheme, a new long-video dataset, and a benchmark, advancing long-context video modeling capabilities.
Findings
Achieves approximately 1/50 compression ratio with minimal performance loss.
VideoChat-Flash outperforms existing models on long and short video benchmarks.
99.1% accuracy on the NIAH benchmark with a 2B model scale.
Abstract
Long-context video modeling is critical for multimodal large language models (MLLMs), enabling them to process movies, online video streams, and so on. Despite its advances, handling long videos remains challenging due to the difficulty in efficiently understanding the extremely long video context. This paper aims to address this issue from aspects of model architecture, training data, training strategy and evaluation benchmark. First, we propose a novel Hierarchical video token Compression (HiCo) method, which leverages visual redundancy in long videos to compress long video context from Clip-level to Video-level, reducing the computation significantly while preserving essential details, achieving an extreme compression ratio of approximately 1/50 with almost no performance loss. Second, we introduce a multi-stage short-to-long learning scheme, a large-scale dataset of real-world long…
Peer Reviews
Decision·ICLR 2026 Poster
1.The core strength is the HiCo compression framework. Achieving a 1/50 compression ratio (16 tokens/frame) while simultaneously achieving SOTA performance is a breakthrough for practical long-video MLLMs .The design is well-motivated. 2. The introduction of the Multi-Hop NIAH benchmark is a significant contribution. 3. The contributions of the paper are multifold including model design, data creation, and new evaluation.
1. The biggest concern is the novelty of the proposed method. While the system as a whole is novel and highly effective, its constituent parts are largely clever integrations of existing ideas. 2. The ablation in Table 2, while extensive, presents a slightly confusing narrative for HiCo. The baseline (196 tokens/frame) achieves a 63.7 on MLVU. The "+ HiCo" model (16 tokens/frame) drops to 60.6 on MLVU. This seems to contradict the "almost no performance loss" claim from the abstract. The perform
This is a well-structured and thoroughly executed paper. It presents extensive and insightful experiments and analyses that may also benefit future work in video understanding. The proposed HiCo framework demonstrates strong and consistent performance across multiple benchmarks. Additionally, the authors provide a new training dataset and evaluation benchmark, which will further benefit the community and advance progress in video understanding. The comprehensive ablation studies clearly illustra
1. Some recently released baseline video models [1][2][3][4] are not included in the comparison. It would improve readability if the authors could explicitly indicate which model represents the previous state of the art in Figure 1. 2. It would be beneficial to include more advanced video models as baselines for comparison on the multi-hop NIAH task. Presenting the performance of closed-source models, such as Gemini 2.5 Pro, could also provide valuable context for readers to comprehend the compl
1. The topic is meaningful to the community: we need to find a better way to understand long videos with MLLMs efficiently. 2. The paper is well-organized and easy to follow. The experiments are solid and clear. 3. I really appreciate the authors for the LongVid dataset and benchmark. Open-sourced data is important in this community.
1. The paper claims “almost no performance loss” under extreme 1/50 compression, yet the quantitative analysis (Fig. 6b) provides limited breakdown across different downstream tasks. It would be helpful if they can provide the results on temporal grounding or motion-sensitive tasks besides the three benchmarks. 2. The claimed efficiency benefits (1/50 token reduction) are shown in FLOP counts but not in end-to-end wall-clock latency or GPU memory consumption during real long video inference. It
Code & Models
- 🤗OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448model· 741 dl· ♡ 27741 dl♡ 27
- 🤗OpenGVLab/VideoChat-Flash-Qwen2-7B_res224model· 119 dl· ♡ 7119 dl♡ 7
- 🤗OpenGVLab/VideoChat-Flash-Qwen2-7B_res448model· 1.0k dl· ♡ 131.0k dl♡ 13
- 🤗OpenGVLab/InternVL_2_5_HiCo_R16model· 210 dl· ♡ 6210 dl♡ 6
- 🤗OpenGVLab/InternVL_2_5_HiCo_R64model· 85 dl· ♡ 385 dl♡ 3
- 🤗OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1Bmodel· 1.8k dl· ♡ 71.8k dl♡ 7
- 🤗OpenGVLab/VideoChat-Flash-Qwen2_5-7B-1M_res224model· 41 dl· ♡ 241 dl♡ 2
- 🤗FriendliAI/InternVL_2_5_HiCo_R16model· 6 dl· ♡ 16 dl♡ 1
- 🤗MInference/videochatmodel· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Video Coding and Compression Technologies · Video Analysis and Summarization
