3D Spatial Understanding in MLLMs: Disambiguation and Evaluation
Chun-Peng Chang, Alain Pagani, Didier Stricker

TL;DR
This paper addresses the challenge of improving 3D spatial understanding in Multimodal Large Language Models (MLLMs), focusing on object disambiguation and localization in complex environments, crucial for robotic collaboration.
Contribution
The authors propose simple techniques to enhance MLLMs' ability to localize and disambiguate objects in 3D, achieving state-of-the-art results and better spatial understanding.
Findings
Achieved state-of-the-art performance on sentence similarity metrics.
Demonstrated improved 3D visual grounding capabilities.
Enhanced model ability to disambiguate objects in complex scenes.
Abstract
Multimodal Large Language Models (MLLMs) have made significant progress in tasks such as image captioning and question answering. However, while these models can generate realistic captions, they often struggle with providing precise instructions, particularly when it comes to localizing and disambiguating objects in complex 3D environments. This capability is critical as MLLMs become more integrated with collaborative robotic systems. In scenarios where a target object is surrounded by similar objects (distractors), robots must deliver clear, spatially-aware instructions to guide humans effectively. We refer to this challenge as contextual object localization and disambiguation, which imposes stricter constraints than conventional 3D dense captioning, especially regarding ensuring target exclusivity. In response, we propose simple yet effective techniques to enhance the model's ability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
