TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
Ya-Qi Yu, Minghui Liao, Jihao Wu, Yongxin Liao, Xiaoyu Zheng, Wei Zeng

TL;DR
TextHawk is a specialized multimodal large language model designed for document-oriented tasks, featuring novel components for efficient, fine-grained perception and demonstrating superior performance on relevant benchmarks.
Contribution
The paper introduces four dedicated components for efficient fine-grained perception in MLLMs and creates a new instruction-tuning dataset for document tasks, advancing the field.
Findings
Outperforms state-of-the-art methods on document perception benchmarks
Reduces computational cost through ReSampling and ReArrangement (ReSA)
Enhances visual perception with Multi-Level Cross-Attention (MLCA)
Abstract
Multimodal Large Language Models (MLLMs) have shown impressive results on various multimodal tasks. However, most existing MLLMs are not well suited for document-oriented tasks, which require fine-grained image perception and information compression. In this paper, we present TextHawk, a MLLM that is specifically designed for document-oriented tasks, while preserving the general capabilities of MLLMs. TextHawk is aimed to explore efficient fine-grained perception by designing four dedicated components. Firstly, a ReSampling and ReArrangement (ReSA) module is proposed to reduce the redundancy in the document texts and lower the computational cost of the MLLM. We explore encoding the positions of each local feature by presenting Scalable Positional Embeddings (SPEs), which can preserve the scalability of various image sizes. A Query Proposal Network (QPN) is then adopted to initialize the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
