TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens
Ya-Qi Yu, Minghui Liao, Jiwen Zhang, Jihao Wu

TL;DR
TextHawk2 is a bilingual vision-language model that achieves state-of-the-art performance in OCR and grounding tasks with 16 times fewer image tokens, enabling resource-efficient deployment and broad task applicability.
Contribution
It introduces a token compression technique, visual encoder reinforcement through co-training, and diversified data sources, significantly improving efficiency and performance over prior LVLMs.
Findings
Achieves 78.4% OCRBench accuracy
Attains 81.4% ChartQA accuracy
Reaches 89.6% ANLS on DocVQA
Abstract
Reading dense text and locating objects within images are fundamental abilities for Large Vision-Language Models (LVLMs) tasked with advanced jobs. Previous LVLMs, including superior proprietary models like GPT-4o, have struggled to excel in both tasks simultaneously. Moreover, previous LVLMs with fine-grained perception cost thousands of tokens per image, making them resource-intensive. We present TextHawk2, a bilingual LVLM featuring efficient fine-grained perception and demonstrating cutting-edge performance across general-purpose, OCR, and grounding tasks with 16 times fewer image tokens. Critical improvements include: (1) Token Compression: Building on the efficient architecture of its predecessor, TextHawk2 significantly reduces the number of tokens per image by 16 times, facilitating training and deployment of the TextHawk series with minimal resources. (2) Visual Encoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Multimodal Machine Learning Applications
