PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

Ansel Blume; Jeonghwan Kim; Hyeonjeong Ha; Elen Chatikyan; Xiaomeng Jin; Khanh Duy Nguyen; Nanyun Peng; Kai-Wei Chang; Derek Hoiem; Heng Ji

arXiv:2505.20759·cs.CV·October 28, 2025

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

Ansel Blume, Jeonghwan Kim, Hyeonjeong Ha, Elen Chatikyan, Xiaomeng Jin, Khanh Duy Nguyen, Nanyun Peng, Kai-Wei Chang, Derek Hoiem, Heng Ji

PDF

Open Access 1 Models 1 Datasets 1 Video

TL;DR

This paper introduces PARTONOMY, a challenging benchmark for part-level visual understanding in large multimodal models, and proposes PLUM, a novel segmenting LMM that improves part grounding and reasoning capabilities.

Contribution

The paper presents a new benchmark dataset for part grounding, identifies limitations in current models, and introduces PLUM, a new segmenting LMM with improved architecture and performance.

Findings

01

State-of-the-art LMMs perform poorly on part grounding (e.g., 5.9% gIoU).

02

PLUM outperforms existing segmenting LMMs on reasoning and VQA tasks.

03

Finetuned PLUM is competitive with models trained on more data.

Abstract

Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 part labels and 534 object labels for evaluation. Unlike existing datasets that simply ask models to identify generic parts, PARTONOMY uses specialized concepts (e.g., agricultural airplane), and challenges models to compare objects' parts, consider part-whole relationships, and justify textual predictions with visual segmentations. Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
wjdghks950/plum-13b-pretrained
model· ♡ 1
♡ 1

Datasets

partonomy/partonomy-core
dataset· 30 dl
30 dl

Videos

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding· slideslive

Taxonomy

TopicsSemantic Web and Ontologies