LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

Yuqian Yuan; Wenqiao Zhang; Juekai Lin; Yu Zhong; Mingjian Gao; Binhe Yu; Yunqi Cao; Wentong Li; Yueting Zhuang; Beng Chin Ooi

arXiv:2604.11789·cs.CV·April 21, 2026

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

Yuqian Yuan, Wenqiao Zhang, Juekai Lin, Yu Zhong, Mingjian Gao, Binhe Yu, Yunqi Cao, Wentong Li, Yueting Zhuang, Beng Chin Ooi

PDF

TL;DR

This paper reviews recent advances in integrating large multimodal models with object-centric vision, focusing on understanding, segmentation, editing, and generation of visual objects.

Contribution

It provides a structured overview of the key paradigms, strategies, and challenges in developing object-centric multimodal vision systems.

Findings

01

Organized literature into four major themes: understanding, segmentation, editing, and generation.

02

Summarized modeling paradigms, learning strategies, and evaluation protocols.

03

Discussed open challenges like instance permanence and spatial control.

Abstract

Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled framework for addressing these challenges by promoting explicit representations and operations over visual entities, thereby extending multimodal systems from global scene understanding to object-level understanding, segmentation, editing, and generation. This paper presents a comprehensive review of recent advances at the convergence of LMMs and object-centric vision. We organize the literature into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.