Can Multimodal Large Language Models Truly Understand Small Objects?

Fujun Han; Junan Chen; Xintong Zhu; Jingqi Ye; Xuanjie Mao; Tao Chen; Peng Ye

arXiv:2604.22884·cs.CV·April 28, 2026

Can Multimodal Large Language Models Truly Understand Small Objects?

Fujun Han, Junan Chen, Xintong Zhu, Jingqi Ye, Xuanjie Mao, Tao Chen, Peng Ye

PDF

1 Repo

TL;DR

This paper introduces SOUBench, a comprehensive benchmark and datasets for evaluating and improving multimodal large language models' ability to understand small objects across various scenarios.

Contribution

It presents the first dedicated benchmark and datasets for small object understanding in multimodal models, along with a training dataset to enhance their capabilities.

Findings

01

Existing MLLMs show weak small object understanding.

02

SOU-Train improves MLLMs' small object comprehension after fine-tuning.

03

SOU-Bench provides a new foundation for future research in small object understanding.

Abstract

Multimodal Large Language Models (MLLMs) have shown promising potential in diverse understanding tasks, e.g., image and video analysis, math and physics olympiads. However, they remain blank and unexplored for Small Object Understanding (SOU) tasks. To fill this gap, we introduce SOUBench, the first and comprehensive benchmark for exploring the small objects understanding capability of existing MLLMs. Specifically, we first design an effective and automatic visual question-answer generation strategy, constructing a new SOU-VQA evaluation dataset, with 18,204 VQA pairs, six relevant sub-tasks, and three dominant scenarios (i.e., Driving, Aerial, and Underwater). Then, we conduct a comprehensive evaluation on 15 state-of-the-art MLLMs and reveal their weak capabilities in small object understanding. Furthermore, we develop SOU-Train, a multimodal training dataset with 11,226 VQA pairs, to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Hanfj-X/SOU
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.