OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

Yuliang Liu; Zhang Li; Mingxin Huang; Biao Yang; Wenwen Yu; and Chunyuan Li; Xucheng Yin; Cheng-lin Liu; Lianwen Jin; Xiang; Bai

arXiv:2305.07895·cs.CV·December 17, 2024·21 cites

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, and Chunyuan Li, Xucheng Yin, Cheng-lin Liu, Lianwen Jin, Xiang, Bai

PDF

Open Access 1 Repo 3 Datasets

TL;DR

This paper introduces OCRBench, a comprehensive benchmark for evaluating OCR capabilities of large multimodal models like GPT4V and Gemini across diverse text-related visual tasks, revealing their strengths and weaknesses.

Contribution

The paper presents OCRBench, the most extensive OCR evaluation benchmark with 29 datasets, and provides a systematic assessment of large multimodal models' OCR performance.

Findings

01

Models show strengths in certain text recognition tasks.

02

Weaknesses identified in multilingual and handwritten text recognition.

03

Baseline results offer a foundation for future improvements.

Abstract

Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). To facilitate the assessment of Optical Character Recognition (OCR) capabilities in Large Multimodal Models, we propose OCRBench, a comprehensive evaluation benchmark. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available. Furthermore, our study reveals both the strengths and weaknesses of these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuliang-liu/multimodalocr
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques