Integrating Multimodal Large Language Model Knowledge into Amodal Completion
Heecheol Yun, Eunho Yang

TL;DR
This paper introduces AmodalCG, a framework that leverages multimodal large language models to improve the reconstruction of occluded parts in images, enhancing amodal completion in autonomous vehicle and robotics applications.
Contribution
It proposes a novel method that integrates real-world knowledge from MLLMs to guide and refine amodal completion, addressing limitations of previous approaches.
Findings
Significant improvement over existing methods in real-world image completion tasks.
Selective use of MLLM guidance based on occlusion severity enhances accuracy.
Iterative refinement with visual generative models produces more complete reconstructions.
Abstract
With the widespread adoption of autonomous vehicles and robotics, amodal completion, which reconstructs the occluded parts of people and objects in an image, has become increasingly crucial. Just as humans infer hidden regions based on prior experience and common sense, this task inherently requires physical knowledge about real-world entities. However, existing approaches either depend solely on the image generation ability of visual generative models, which lack such knowledge, or leverage it only during the segmentation stage, preventing it from explicitly guiding the completion process. To address this, we propose AmodalCG, a novel framework that harnesses the real-world knowledge of Multimodal Large Language Models (MLLMs) to guide amodal completion. Our framework first assesses the extent of occlusion to selectively invoke MLLM guidance only when the target object is heavily…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
