Long Context Transfer from Language to Vision
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang,, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, Ziwei Liu

TL;DR
This paper introduces a method to extend the context length of language models for vision tasks, enabling processing of extremely long videos without retraining, and demonstrates state-of-the-art results on a new long video benchmark.
Contribution
It proposes long context transfer by extrapolating language model context length to vision, and develops a synthetic benchmark to evaluate long video understanding.
Findings
Long Video Assistant processes over 200K visual tokens.
Achieves state-of-the-art on Video-MME with 7B-scale models.
Enables comprehension of 2000 frames without additional training.
Abstract
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. Many works address this by reducing the number of visual tokens using visual resamplers. Alternatively, in this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon long context transfer and carefully ablate its properties. To effectively measure LMMs' ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark inspired by the language model's NIAH test. Our proposed Long Video Assistant (LongVA) can process 2000 frames or over 200K visual…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The proposed method achieves the state of the art on multiple video understand benchmarks. 2. The proposed method is simple and effective. The continuous pretraining on large context window only needs text data, while the multimodal alignment stage only requires image-text data. The uniRes encoding is straightforward extension of AnyRes to videos. Nevertheless, the model seems to generalize surprisingly well to videos. 3. The Needle-in-a-Haystack test seems to be novel in the context of l
1. Despite the impressive empirical results, there are not much explanations. In particular, I am interested in understanding why the long-context pretraining on text-only data could transfer to long video understanding. 2. It is not clear on how the image-text alignment is done e.g., training data and losses. Most of the content in the Sec. 3.2 is about the UniRes which is relatively straightforward. 3. The number of grids during training is limited to 49 (Line 205), which is effectively equ
(1) Explored a method to implement long-context video MLLM from the perspective of long-context text training. The idea is very interesting. (2) The evaluation of long-context text retrieval (NIAH) and video retrieval (V-NIAH) is very complete and well presented. (3) The experimental results are rich. LongVA was evaluated on different multi-modal benchmarks and achieved SOTA performance on Video-MME and MLVU.
(1) The results of LongVA on some benchmarks are not ideal, such as ActivityNetQA and VideoChatGPT, which is reflected in the fact that more frames do not show better video understanding ability. This makes people suspect that LongVA only has the ability to retrieve long-context video information, but lacks the ability to understand long-context videos. (2) The paper does not give detailed training details, such as the training details of the long-context LLM of pure text, the training details
1. Novel Perspective: The paper presents a fresh perspective on handling long video sequences in LMMs by identifying the language model's context length as the primary bottleneck rather than following the conventional approach of reducing visual tokens. This reframing of the problem leads to a novel solution pathway. 2. Technical Innovation: The discovery of the "long context transfer" phenomenon represents a significant contribution to the field. The ability to transfer extended context capabi
1. Visual Encoding Ablations: The paper lacks comprehensive analysis of different visual encoding schemes' impact on performance. Specifically: - The comparison between UniRes, AnyRes, and Higher-AnyRes is not thoroughly explored - The effect of including or excluding base image encoding is not clearly demonstrated - The impact of different layout configurations for extended images (1 x N, N x 1, etc.) is not investigated These ablation studies would provide valuable insights into th
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Robotics and Automated Systems
