StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image   Perception, Comprehension, and Beyond

Pengyuan Lyu; Yulin Li; Hao Zhou; Weihong Ma; Xingyu Wan; Qunyi Xie,; Liang Wu; Chengquan Zhang; Kun Yao; Errui Ding; Jingdong Wang

arXiv:2405.21013·cs.CV·June 5, 2024

StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond

Pengyuan Lyu, Yulin Li, Hao Zhou, Weihong Ma, Xingyu Wan, Qunyi Xie,, Liang Wu, Chengquan Zhang, Kun Yao, Errui Ding, Jingdong Wang

PDF

Open Access

TL;DR

StrucTexTv3 is an efficient vision-language model designed for understanding text-rich images, combining multi-scale transformers, token sampling, and instruction learning to achieve state-of-the-art performance across perception and comprehension tasks.

Contribution

The paper introduces StrucTexTv3, a novel model that integrates multi-scale transformers, MG-Sampler, and instruction learning for improved text-rich image understanding.

Findings

01

Achieved state-of-the-art results in perception tasks

02

Significantly improved performance in comprehension tasks

03

Demonstrated robustness across diverse scenarios

Abstract

Text-rich images have significant and extensive value, deeply integrated into various aspects of human life. Notably, both visual cues and linguistic symbols in text-rich images play crucial roles in information transmission but are accompanied by diverse challenges. Therefore, the efficient and effective understanding of text-rich images is a crucial litmus test for the capability of Vision-Language Models. We have crafted an efficient vision-language model, StrucTexTv3, tailored to tackle various intelligent tasks for text-rich images. The significant design of StrucTexTv3 is presented in the following aspects: Firstly, we adopt a combination of an effective multi-scale reduced visual transformer and a multi-granularity token sampler (MG-Sampler) as a visual token generator, successfully solving the challenges of high-resolution input and complex representation learning for text-rich…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsHigh-resolution input