Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks

Mengzhao Jia; Wenhao Yu; Kaixin Ma; Tianqing Fang; Zhihan Zhang; Siru Ouyang; Hongming Zhang; Dong Yu; Meng Jiang

arXiv:2410.01744·cs.CV·June 9, 2025

Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks

Mengzhao Jia, Wenhao Yu, Kaixin Ma, Tianqing Fang, Zhihan Zhang, Siru Ouyang, Hongming Zhang, Dong Yu, Meng Jiang

PDF

Open Access 1 Repo 2 Models 1 Datasets 3 Reviews

TL;DR

Leopard is a new vision-language model designed for understanding and reasoning across multiple text-rich images, addressing data scarcity and resolution challenges, and outperforming existing models on complex multi-image tasks.

Contribution

We introduce Leopard, a multimodal model with a novel high-resolution multi-image encoding module and a large, high-quality instruction dataset for text-rich, multi-image scenarios.

Findings

01

Outperforms state-of-the-art models like Llama-3.2 and Qwen2-VL on multi-image benchmarks.

02

Achieves high performance with only 1.2 million training instances.

03

Open-sourced code and data facilitate further research.

Abstract

Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length. To address these challenges, we propose Leopard, an MLLM tailored for handling vision-language tasks involving…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 3

Strengths

- The authors deliver a specified text-rich multi image instruction tuning dataset which is highly comprehensive with coverage of diverse domains (documents, slides, charts, tables, webpages). - The paper presentation is clear and informative enough to demonstrate their key ideas.

Weaknesses

- The proposed methodology and model seems just a combination of some existing works like llava-interleave and pixel shuffling. The novelty is just not clear enough to me at least based on current manuscript. While the adaptive high-resolution encoding is presented as novel, the paper lacks clear theoretical justification for why it should perform better than fixed allocation strategies. The relationship to and improvements upon prior work need to be more explicitly discussed to establish the t

Reviewer 02Rating 3Confidence 4

Strengths

Please refer to Questions

Weaknesses

Please refer to Questions

Reviewer 03Rating 5Confidence 4

Strengths

This paper has two major contributions: 1. LEOPARD-INSTRUCT: The paper creates a large-scale, high-quality multimodal instruction-tuning dataset specifically curated for text-rich, multi-image scenarios. This dataset addresses the scarcity of such resources and provides a robust foundation for training and improving MLLMs on complex, real-world tasks involving multiple interconnected images. 2. Adaptive High-Resolution Multi-Image Encoding: The paper develops an adaptive high-resolution multi-im

Weaknesses

The paper's weaknesses include: 1. Lack of detailed quality control for data generated by GPT-4o, which could impact model performance 2. Limited comparison with the latest visual encoding strategies like Qwen2-VL's dynamic resolution approach.

Code & Models

Repositories

jill0001/leopard
pytorchOfficial

Models

Datasets

wyu1/Leopard-Instruct
dataset· 96k dl
96k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications