OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Yufeng Zhong; Lei Chen; Xuanle Zhao; Wenkang Han; Liming Zheng; Jing Huang; Deyang Jiang; Yilin Cao; Lin Ma; Zhixiong Zeng

arXiv:2601.21639·cs.CV·February 5, 2026

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Yufeng Zhong, Lei Chen, Xuanle Zhao, Wenkang Han, Liming Zheng, Jing Huang, Deyang Jiang, Yilin Cao, Lin Ma, Zhixiong Zeng

PDF

Open Access

TL;DR

OCRVerse is a comprehensive end-to-end OCR system that unifies text-centric and vision-centric recognition, effectively handling diverse visual data like documents, charts, and web pages, with innovative training strategies.

Contribution

The paper introduces OCRVerse, the first holistic OCR framework that combines text-centric and vision-centric recognition in a unified model, supported by a novel multi-domain training approach.

Findings

01

Achieves competitive results on diverse OCR datasets.

02

Effectively handles both text and visual element recognition.

03

Demonstrates robustness across multiple domains.

Abstract

The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from images or scanned documents (Text-centric OCR), neglecting the identification of visual elements from visually information-dense image sources (Vision-centric OCR), such as charts, web pages and science plots. In reality, these visually information-dense images are widespread on the internet and have significant real-world application value, such as data visualization and web page analysis. In this technical report, we propose OCRVerse, the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR. To this end, we constructe comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Topic Modeling