Qianfan-OCR: A Unified End-to-End Model for Document Intelligence
Daxiang Dong, Mingming Zheng, Dong Xu, Chunhua Luo, Bairong Zhuang, Yuxuan Li, Ruoyun He, Haoran Wang, Wenyu Zhang, Wenbo Wang, Yicheng Wang, Xue Xiong, Ayong Zheng, Xiaoying Zuo, Ziwei Ou, Jingnan Gu, Quanhao Guo, Jianmin Wu, Dawei Yin, Dou Shen

TL;DR
Qianfan-OCR is a large unified vision-language model that integrates document parsing, layout analysis, and understanding, enabling direct image-to-Markdown conversion and supporting diverse prompt-driven document tasks.
Contribution
It introduces Layout-as-Thought, a novel thinking phase that recovers layout grounding in end-to-end OCR models, enhancing accuracy on complex layouts.
Findings
Ranks first on OmniDocBench v1.5 with 93.12 score
Achieves top results on OlmOCR Bench with 79.8 score
Outperforms several models on key information extraction benchmarks
Abstract
We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗baidu/Qianfan-OCRmodel· 18k dl· ♡ 75518k dl♡ 755
- 🤗jason1966/Qianfan-OCR-MLX-4bitmodel· 373 dl· ♡ 2373 dl♡ 2
- 🤗Abhiray/Qianfan-OCR-GGUFmodel· 565 dl565 dl
- 🤗singersalt/Qianfan-OCRmodel· 9 dl9 dl
- 🤗FriskyFennec/Qianfan-OCR-8bitmodel· 92 dl92 dl
- 🤗jackjohn001/Qianfan-OCRmodel· 9 dl9 dl
- 🤗zhoude/Qianfan-OCRmodel· 8 dl8 dl
- 🤗beaupi/Qianfan-OCR-oQ6model· 11 dl11 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Topic Modeling · Image Retrieval and Classification Techniques
