Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Daxiang Dong; Mingming Zheng; Dong Xu; Chunhua Luo; Bairong Zhuang; Yuxuan Li; Ruoyun He; Haoran Wang; Wenyu Zhang; Wenbo Wang; Yicheng Wang; Xue Xiong; Ayong Zheng; Xiaoying Zuo; Ziwei Ou; Jingnan Gu; Quanhao Guo; Jianmin Wu; Dawei Yin; Dou Shen

arXiv:2603.13398·cs.CV·March 17, 2026

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Daxiang Dong, Mingming Zheng, Dong Xu, Chunhua Luo, Bairong Zhuang, Yuxuan Li, Ruoyun He, Haoran Wang, Wenyu Zhang, Wenbo Wang, Yicheng Wang, Xue Xiong, Ayong Zheng, Xiaoying Zuo, Ziwei Ou, Jingnan Gu, Quanhao Guo, Jianmin Wu, Dawei Yin, Dou Shen

PDF

Open Access 8 Models

TL;DR

Qianfan-OCR is a large unified vision-language model that integrates document parsing, layout analysis, and understanding, enabling direct image-to-Markdown conversion and supporting diverse prompt-driven document tasks.

Contribution

It introduces Layout-as-Thought, a novel thinking phase that recovers layout grounding in end-to-end OCR models, enhancing accuracy on complex layouts.

Findings

01

Ranks first on OmniDocBench v1.5 with 93.12 score

02

Achieves top results on OlmOCR Bench with 79.8 score

03

Outperforms several models on key information extraction benchmarks

Abstract

We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Topic Modeling · Image Retrieval and Classification Techniques