PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding
Ansel Blume, Jeonghwan Kim, Hyeonjeong Ha, Elen Chatikyan, Xiaomeng Jin, Khanh Duy Nguyen, Nanyun Peng, Kai-Wei Chang, Derek Hoiem, Heng Ji

TL;DR
This paper introduces PARTONOMY, a challenging benchmark for part-level visual understanding in large multimodal models, and proposes PLUM, a novel segmenting LMM that improves part grounding and reasoning capabilities.
Contribution
The paper presents a new benchmark dataset for part grounding, identifies limitations in current models, and introduces PLUM, a new segmenting LMM with improved architecture and performance.
Findings
State-of-the-art LMMs perform poorly on part grounding (e.g., 5.9% gIoU).
PLUM outperforms existing segmenting LMMs on reasoning and VQA tasks.
Finetuned PLUM is competitive with models trained on more data.
Abstract
Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 part labels and 534 object labels for evaluation. Unlike existing datasets that simply ask models to identify generic parts, PARTONOMY uses specialized concepts (e.g., agricultural airplane), and challenges models to compare objects' parts, consider part-whole relationships, and justify textual predictions with visual segmentations. Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies
