Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
Zhaowei Wang, Lishu Luo, Haodong Duan, Weiwei Liu, Sijin Wu, Ji Luo, Shen Yan, Shuai Peng, Sihang Yuan, Chaoyi Huang, Yi Lin, and Yangqiu Song

TL;DR
This paper systematically studies long-context training for vision-language models, revealing key insights and introducing MMProLong, which enhances long-document understanding and generalizes to various multimodal tasks.
Contribution
It provides a practical long-context training recipe and demonstrates how to effectively extend vision-language models beyond 128K context length.
Findings
Balanced data distribution improves long-context generalization.
Retrieval is the main bottleneck in long-context modeling.
Long-document VQA preserves short-context capabilities.
Abstract
Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
