VDInstruct: Zero-Shot Key Information Extraction via Content-Aware Vision Tokenization

Son Nguyen; Giang Nguyen; Hung Dao; Thao Do; Daeyoung Kim

arXiv:2507.09531·cs.CV·July 15, 2025

VDInstruct: Zero-Shot Key Information Extraction via Content-Aware Vision Tokenization

Son Nguyen, Giang Nguyen, Hung Dao, Thao Do, Daeyoung Kim

PDF

Open Access

TL;DR

VDInstruct introduces a content-aware tokenization approach for zero-shot key information extraction from visual documents, achieving state-of-the-art accuracy while significantly reducing computational redundancy.

Contribution

It proposes a novel content-aware tokenization method and a three-stage training paradigm, improving efficiency and accuracy in document understanding tasks.

Findings

01

Achieves SOTA results on KIE benchmarks.

02

Reduces image tokens by approximately 3.6x.

03

Surpasses strong baselines in zero-shot evaluations.

Abstract

Key Information Extraction (KIE) underpins the understanding of visual documents (e.g., receipts and contracts) by extracting precise semantic content and accurately capturing spatial structure. Yet existing multimodal large language models (MLLMs) often perform poorly on dense documents and rely on vision tokenization approaches that scale with image size, leading to redundant computation and memory inefficiency. To address these challenges, we introduce VDInstruct, an MLLM that separates spatial region detection from semantic feature extraction. Central to our model is a content-aware tokenization strategy: rather than fragmenting the entire image uniformly, it generates tokens in proportion to document complexity, preserving critical structure while eliminating wasted tokens. Leveraging a three-stage training paradigm, our model achieves state-of-the-art (SOTA) results on KIE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques