Internalized Reasoning for Long-Context Visual Document Understanding

Austin Veselka

arXiv:2604.02371·cs.CV·April 6, 2026

Internalized Reasoning for Long-Context Visual Document Understanding

Austin Veselka

PDF

1 Models

TL;DR

This paper presents a synthetic data pipeline for enhancing reasoning in long-document visual understanding, leading to improved performance and efficiency in enterprise, legal, and scientific tasks.

Contribution

It introduces a novel synthetic reasoning data pipeline and internalized reasoning method, significantly boosting performance on long-document benchmarks.

Findings

01

Achieved 58.3 on MMLongBenchDoc with Qwen3 VL, surpassing larger models.

02

Synthetic reasoning outperforms distillation by 3.8 points on MMLBD-C.

03

Internalized reasoning reduces output tokens by 12.4 times.

Abstract

Visual long-document understanding is critical for enterprise, legal, and scientific applications, yet the best performing open recipes have not explored reasoning, a capability which has driven leaps in math and code performance. We introduce a synthetic data pipeline for reasoning in long-document understanding that generates thinking traces by scoring each page for question relevance, extracting textual evidence and ordering it from most to least relevant. We apply SFT to the resulting traces within \texttt{<think>} tags, gated by a \texttt{<cot>} control token, and the resulting reasoning capability is internalized via low-strength model merging. We study Qwen3 VL 32B and Mistral Small 3.1 24B. With Qwen3 VL, we achieve 58.3 on MMLongBenchDoc, surpassing the 7 $\times$ larger Qwen3 VL 235B A22B (57.0). With Mistral, we show that synthetic reasoning outperforms distillation from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
lightonai/OriOn-Qwen-SR1
model· 14 dl· ♡ 4
14 dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.