Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding

Kun Yin; Yunfei Wu; Bing Liu; Zhongpeng Cai; Xiaotian Li; Huang Chen; Xin Li; Haoyu Cao; Yinsong Liu; Deqiang Jiang; Xing Sun; Yunsheng Wu; Qianyu Li; Antai Guo; Yanzhen Liao; Yanqiu Qu; Haodong Lin; Chengxu He; Shuangyin Liu

arXiv:2601.20430·cs.CV·January 29, 2026

Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding

Kun Yin, Yunfei Wu, Bing Liu, Zhongpeng Cai, Xiaotian Li, Huang Chen, Xin Li, Haoyu Cao, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Qianyu Li, Antai Guo, Yanzhen Liao, Yanqiu Qu, Haodong Lin, Chengxu He, Shuangyin Liu

PDF

Open Access 2 Models

TL;DR

Youtu-Parsing introduces a high-parallelism decoding framework using Vision Transformers and prompt-guided language models, achieving state-of-the-art document parsing speed and accuracy across diverse document elements and challenging content types.

Contribution

The paper presents a novel high-parallelism decoding strategy for document parsing, combining token and query parallelism with a versatile architecture for improved speed and robustness.

Findings

01

5-11x faster decoding than traditional methods

02

State-of-the-art performance on OmniDocBench and olmOCR-bench

03

Robust handling of multilingual, handwritten, and rare characters

Abstract

This paper presents Youtu-Parsing, an efficient and versatile document parsing model designed for high-performance content extraction. The architecture employs a native Vision Transformer (ViT) featuring a dynamic-resolution visual encoder to extract shared document features, coupled with a prompt-guided Youtu-LLM-2B language model for layout analysis and region-prompted decoding. Leveraging this decoupled and feature-reusable framework, we introduce a high-parallelism decoding strategy comprising two core components: token parallelism and query parallelism. The token parallelism strategy concurrently generates up to 64 candidate tokens per inference step, which are subsequently validated through a verification mechanism. This approach yields a 5--11x speedup over traditional autoregressive decoding and is particularly well-suited for highly structured scenarios, such as table…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Topic Modeling