How to Train Your Long-Context Visual Document Model

Austin Veselka

arXiv:2602.15257·cs.CV·April 1, 2026

How to Train Your Long-Context Visual Document Model

Austin Veselka

PDF

3 Models 1 Datasets

TL;DR

This paper systematically studies training long-context vision-language models up to 344K context length, achieving state-of-the-art results on long-document visual question answering benchmarks.

Contribution

It provides reproducible training recipes, extensive evaluations, and novel insights into long-context training, including data pipelines and transfer capabilities between visual and text domains.

Findings

01

Training with context lengths matching evaluation improves performance.

02

Page indices significantly boost long-document understanding.

03

Synthetic data pipelines enable effective self-improvement.

Abstract

We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

lightonai/MMLBD-C
dataset· 294 dl
294 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.