How to Train Your Long-Context Visual Document Model
Austin Veselka

TL;DR
This paper systematically studies training long-context vision-language models up to 344K context length, achieving state-of-the-art results on long-document visual question answering benchmarks.
Contribution
It provides reproducible training recipes, extensive evaluations, and novel insights into long-context training, including data pipelines and transfer capabilities between visual and text domains.
Findings
Training with context lengths matching evaluation improves performance.
Page indices significantly boost long-document understanding.
Synthetic data pipelines enable effective self-improvement.
Abstract
We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
