HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

Lei Yao; Yong Chen; Yuejiao Su; Yi Wang; Moyun Liu; Lap-Pui Chau

arXiv:2603.02329·cs.CV·March 4, 2026

HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

Lei Yao, Yong Chen, Yuejiao Su, Yi Wang, Moyun Liu, Lap-Pui Chau

PDF

Open Access

TL;DR

HAMMER is a novel framework that leverages multimodal large language models to improve 3D object affordance grounding by integrating cross-modal information and interaction intentions, achieving superior accuracy and robustness.

Contribution

The paper introduces HAMMER, a new method that uses cross-modal integration and intention-driven cues from MLLMs for 3D affordance grounding, avoiding explicit attribute descriptions.

Findings

01

Outperforms existing methods on public datasets.

02

Demonstrates robustness on corrupted benchmarks.

03

Effectively integrates multimodal information for accurate localization.

Abstract

Humans commonly identify 3D object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robot Manipulation and Learning