TL;DR
This paper introduces SOUBench, a comprehensive benchmark and datasets for evaluating and improving multimodal large language models' ability to understand small objects across various scenarios.
Contribution
It presents the first dedicated benchmark and datasets for small object understanding in multimodal models, along with a training dataset to enhance their capabilities.
Findings
Existing MLLMs show weak small object understanding.
SOU-Train improves MLLMs' small object comprehension after fine-tuning.
SOU-Bench provides a new foundation for future research in small object understanding.
Abstract
Multimodal Large Language Models (MLLMs) have shown promising potential in diverse understanding tasks, e.g., image and video analysis, math and physics olympiads. However, they remain blank and unexplored for Small Object Understanding (SOU) tasks. To fill this gap, we introduce SOUBench, the first and comprehensive benchmark for exploring the small objects understanding capability of existing MLLMs. Specifically, we first design an effective and automatic visual question-answer generation strategy, constructing a new SOU-VQA evaluation dataset, with 18,204 VQA pairs, six relevant sub-tasks, and three dominant scenarios (i.e., Driving, Aerial, and Underwater). Then, we conduct a comprehensive evaluation on 15 state-of-the-art MLLMs and reveal their weak capabilities in small object understanding. Furthermore, we develop SOU-Train, a multimodal training dataset with 11,226 VQA pairs, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
