3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
Makanjuola Ogunleye, Eman Abdelrahman, Ismini Lourentzou

TL;DR
This paper introduces 3D-VCD, a novel inference-time contrastive decoding method that reduces hallucinations in 3D embodied agents by contrasting predictions in original and perturbed 3D scene graphs.
Contribution
It is the first to apply visual contrastive decoding at inference time for hallucination mitigation in 3D embodied reasoning, without retraining.
Findings
3D-VCD improves grounded reasoning on 3D-POPE and HEAL benchmarks.
The method suppresses language priors that are insensitive to scene grounding.
It enhances reliability of embodied agents without additional training.
Abstract
Large multimodal models are increasingly used as the reasoning core of embodied agents operating in 3D environments, yet they remain prone to hallucinations that can produce unsafe and ungrounded decisions. Existing inference-time hallucination mitigation methods largely target 2D vision-language settings and do not transfer to embodied 3D reasoning, where failures arise from object presence, spatial layout, and geometric grounding rather than pixel-level inconsistencies. We introduce 3D-VCD, the first inference-time visual contrastive decoding framework for hallucination mitigation in 3D embodied agents. 3D-VCD constructs a distorted 3D scene graph by applying semantic and geometric perturbations to object-centric representations, such as category substitutions and coordinate or extent corruption. By contrasting predictions under the original and distorted 3D contexts, our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
