Blocks as Probes: Dissecting Categorization Ability of Large Multimodal   Models

Bin Fu; Qiyang Wan; Jialin Li; Ruiping Wang; Xilin Chen

arXiv:2409.01560·cs.CV·September 4, 2024

Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models

Bin Fu, Qiyang Wan, Jialin Li, Ruiping Wang, Xilin Chen

PDF

Open Access

TL;DR

This paper introduces ComBo, a new benchmark for evaluating the fundamental categorization abilities of large multimodal models, revealing their strengths and limitations in human-like perception and understanding.

Contribution

The paper presents ComBo, a novel benchmark that dissects the categorization process in LMMs, offering a detailed quantitative evaluation of their learning and usage capabilities.

Findings

01

LMMs show acceptable generalization in learning new categories

02

Gaps remain in fine-grained spatial perception and abstract understanding

03

Benchmark provides insights for improving interpretability and generalization

Abstract

Categorization, a core cognitive ability in humans that organizes objects based on common features, is essential to cognitive science as well as computer vision. To evaluate the categorization ability of visual AI models, various proxy tasks on recognition from datasets to open world scenarios have been proposed. Recent development of Large Multimodal Models (LMMs) has demonstrated impressive results in high-level visual tasks, such as visual question answering, video temporal reasoning, etc., utilizing the advanced architectures and large-scale multimodal instruction tuning. Previous researchers have developed holistic benchmarks to measure the high-level visual capability of LMMs, but there is still a lack of pure and in-depth quantitative evaluation of the most fundamental categorization ability. According to the research on human cognitive process, categorization can be seen as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems