VDInstruct: Zero-Shot Key Information Extraction via Content-Aware Vision Tokenization
Son Nguyen, Giang Nguyen, Hung Dao, Thao Do, Daeyoung Kim

TL;DR
VDInstruct introduces a content-aware tokenization approach for zero-shot key information extraction from visual documents, achieving state-of-the-art accuracy while significantly reducing computational redundancy.
Contribution
It proposes a novel content-aware tokenization method and a three-stage training paradigm, improving efficiency and accuracy in document understanding tasks.
Findings
Achieves SOTA results on KIE benchmarks.
Reduces image tokens by approximately 3.6x.
Surpasses strong baselines in zero-shot evaluations.
Abstract
Key Information Extraction (KIE) underpins the understanding of visual documents (e.g., receipts and contracts) by extracting precise semantic content and accurately capturing spatial structure. Yet existing multimodal large language models (MLLMs) often perform poorly on dense documents and rely on vision tokenization approaches that scale with image size, leading to redundant computation and memory inefficiency. To address these challenges, we introduce VDInstruct, an MLLM that separates spatial region detection from semantic feature extraction. Central to our model is a content-aware tokenization strategy: rather than fragmenting the entire image uniformly, it generates tokens in proportion to document complexity, preserving critical structure while eliminating wasted tokens. Leveraging a three-stage training paradigm, our model achieves state-of-the-art (SOTA) results on KIE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques
