PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks

Cheng Cui; Yubo Zhang; Ting Sun; Xueqing Wang; Hongen Liu; Manhui Lin; Yue Zhang; Tingquan Gao; Changda Zhou; Jiaxuan Liu; Zelun Zhang; Jing Zhang; Jun Zhang; Yi Liu

arXiv:2603.24373·cs.CV·March 26, 2026

PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks

Cheng Cui, Yubo Zhang, Ting Sun, Xueqing Wang, Hongen Liu, Manhui Lin, Yue Zhang, Tingquan Gao, Changda Zhou, Jiaxuan Liu, Zelun Zhang, Jing Zhang, Jun Zhang, Yi Liu

PDF

Open Access

TL;DR

PP-OCRv5 is a lightweight, 5-million-parameter OCR model that rivals large vision-language models in accuracy and localization, emphasizing the importance of data quality and diversity over model size.

Contribution

This paper introduces PP-OCRv5, demonstrating that a small, optimized OCR model can achieve competitive performance through data-centric strategies rather than architectural scaling.

Findings

01

High-quality, diverse, and accurate training data significantly boost OCR performance.

02

PP-OCRv5 outperforms many billion-parameter models on standard benchmarks.

03

Data quality and diversity are more critical than model size for OCR accuracy.

Abstract

The advent of "OCR 2.0" and large-scale vision-language models (VLMs) has set new benchmarks in text recognition. However, these unified architectures often come with significant computational demands, challenges in precise text localization within complex layouts, and a propensity for textual hallucinations. Revisiting the prevailing notion that model scale is the sole path to high accuracy, this paper introduces PP-OCRv5, a meticulously optimized, lightweight OCR system with merely 5 million parameters. We demonstrate that PP-OCRv5 achieves performance competitive with many billion-parameter VLMs on standard OCR benchmarks, while offering superior localization precision and reduced hallucinations. The cornerstone of our success lies not in architectural expansion but in a data-centric investigation. We systematically dissect the role of training data by quantifying three critical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques