OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, and Chunyuan Li, Xucheng Yin, Cheng-lin Liu, Lianwen Jin, Xiang, Bai

TL;DR
This paper introduces OCRBench, a comprehensive benchmark for evaluating OCR capabilities of large multimodal models like GPT4V and Gemini across diverse text-related visual tasks, revealing their strengths and weaknesses.
Contribution
The paper presents OCRBench, the most extensive OCR evaluation benchmark with 29 datasets, and provides a systematic assessment of large multimodal models' OCR performance.
Findings
Models show strengths in certain text recognition tasks.
Weaknesses identified in multilingual and handwritten text recognition.
Baseline results offer a foundation for future improvements.
Abstract
Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). To facilitate the assessment of Optical Character Recognition (OCR) capabilities in Large Multimodal Models, we propose OCRBench, a comprehensive evaluation benchmark. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available. Furthermore, our study reveals both the strengths and weaknesses of these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques
