TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and   Grounding with 16x Fewer Tokens

Ya-Qi Yu; Minghui Liao; Jiwen Zhang; Jihao Wu

arXiv:2410.05261·cs.CV·October 8, 2024

TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens

Ya-Qi Yu, Minghui Liao, Jiwen Zhang, Jihao Wu

PDF

Open Access 1 Repo

TL;DR

TextHawk2 is a bilingual vision-language model that achieves state-of-the-art performance in OCR and grounding tasks with 16 times fewer image tokens, enabling resource-efficient deployment and broad task applicability.

Contribution

It introduces a token compression technique, visual encoder reinforcement through co-training, and diversified data sources, significantly improving efficiency and performance over prior LVLMs.

Findings

01

Achieves 78.4% OCRBench accuracy

02

Attains 81.4% ChartQA accuracy

03

Reaches 89.6% ANLS on DocVQA

Abstract

Reading dense text and locating objects within images are fundamental abilities for Large Vision-Language Models (LVLMs) tasked with advanced jobs. Previous LVLMs, including superior proprietary models like GPT-4o, have struggled to excel in both tasks simultaneously. Moreover, previous LVLMs with fine-grained perception cost thousands of tokens per image, making them resource-intensive. We present TextHawk2, a bilingual LVLM featuring efficient fine-grained perception and demonstrating cutting-edge performance across general-purpose, OCR, and grounding tasks with 16 times fewer image tokens. Critical improvements include: (1) Token Compression: Building on the efficient architecture of its predecessor, TextHawk2 significantly reduces the number of tokens per image by 16 times, facilitating training and deployment of the TextHawk series with minimal resources. (2) Visual Encoder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuyq96/texthawk
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Multimodal Machine Learning Applications