Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
Mengzhao Jia, Wenhao Yu, Kaixin Ma, Tianqing Fang, Zhihan Zhang, Siru Ouyang, Hongming Zhang, Dong Yu, Meng Jiang

TL;DR
Leopard is a new vision-language model designed for understanding and reasoning across multiple text-rich images, addressing data scarcity and resolution challenges, and outperforming existing models on complex multi-image tasks.
Contribution
We introduce Leopard, a multimodal model with a novel high-resolution multi-image encoding module and a large, high-quality instruction dataset for text-rich, multi-image scenarios.
Findings
Outperforms state-of-the-art models like Llama-3.2 and Qwen2-VL on multi-image benchmarks.
Achieves high performance with only 1.2 million training instances.
Open-sourced code and data facilitate further research.
Abstract
Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length. To address these challenges, we propose Leopard, an MLLM tailored for handling vision-language tasks involving…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The authors deliver a specified text-rich multi image instruction tuning dataset which is highly comprehensive with coverage of diverse domains (documents, slides, charts, tables, webpages). - The paper presentation is clear and informative enough to demonstrate their key ideas.
- The proposed methodology and model seems just a combination of some existing works like llava-interleave and pixel shuffling. The novelty is just not clear enough to me at least based on current manuscript. While the adaptive high-resolution encoding is presented as novel, the paper lacks clear theoretical justification for why it should perform better than fixed allocation strategies. The relationship to and improvements upon prior work need to be more explicitly discussed to establish the t
Please refer to Questions
Please refer to Questions
This paper has two major contributions: 1. LEOPARD-INSTRUCT: The paper creates a large-scale, high-quality multimodal instruction-tuning dataset specifically curated for text-rich, multi-image scenarios. This dataset addresses the scarcity of such resources and provides a robust foundation for training and improving MLLMs on complex, real-world tasks involving multiple interconnected images. 2. Adaptive High-Resolution Multi-Image Encoding: The paper develops an adaptive high-resolution multi-im
The paper's weaknesses include: 1. Lack of detailed quality control for data generated by GPT-4o, which could impact model performance 2. Limited comparison with the latest visual encoding strategies like Qwen2-VL's dynamic resolution approach.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications
