Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding
Peter El Hachem, Ahmed Nassar, A. Said Gurbuz, Christoph Auer, Peter W. J. Staar

TL;DR
This paper introduces a method to improve the robustness of vision-language models in understanding complex document layouts by pre-resolving layout entities with a lightweight detector, significantly enhancing out-of-distribution performance.
Contribution
The authors propose a novel layout prior injection technique that shares the decoder's generation space, reducing failures without altering the base model architecture.
Findings
Markdown F1 improved from 0.37 to 0.92 on a 10k-page benchmark.
Table TEDS increased from 0.01 to 0.36 on OmniDocBench.
Decoding failures dropped across multiple industrial domains.
Abstract
Vision-Language Models (VLMs) parse documents end-to-end but frequently break down on layouts unlike those seen in training. We attribute this to a two-hop bottleneck: before the decoder can extract content (Hop 2), it must first classify and localize the enclosing layout entity (Hop 1), and when the first hop fails the second collapses into omissions, malformed structure, or autoregressive repetition. We pre-resolve Hop 1 outside the decoder by running a lightweight RT-DETR detector, serializing its outputs in the parser's native DocTags vocabulary, and injecting them into the prompt alongside the full page image. Unlike analyze-then-parse approaches that crop the page, or prior prompt-level priors written in plain text, our prior shares the decoder's generation space and leaves the global image in view as a fallback when detections are noisy. On a 10k-page structural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
