LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
Yuqian Yuan, Wenqiao Zhang, Juekai Lin, Yu Zhong, Mingjian Gao, Binhe Yu, Yunqi Cao, Wentong Li, Yueting Zhuang, Beng Chin Ooi

TL;DR
This paper reviews recent advances in integrating large multimodal models with object-centric vision, focusing on understanding, segmentation, editing, and generation of visual objects.
Contribution
It provides a structured overview of the key paradigms, strategies, and challenges in developing object-centric multimodal vision systems.
Findings
Organized literature into four major themes: understanding, segmentation, editing, and generation.
Summarized modeling paradigms, learning strategies, and evaluation protocols.
Discussed open challenges like instance permanence and spatial control.
Abstract
Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled framework for addressing these challenges by promoting explicit representations and operations over visual entities, thereby extending multimodal systems from global scene understanding to object-level understanding, segmentation, editing, and generation. This paper presents a comprehensive review of recent advances at the convergence of LMMs and object-centric vision. We organize the literature into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
