AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis
Xiaofei Wu, Yi Zhang, Yumeng Liu, Yuexin Ma, Yujiao Shi, Xuming He

TL;DR
AffordGrasp is a diffusion-based framework that generates physically stable, semantically accurate human grasp poses by jointly reasoning over object geometry, affordances, and textual instructions, improving over existing methods.
Contribution
It introduces a novel diffusion model with affordance-aware representations and a distribution adjustment module for improved semantic and physical grasp synthesis.
Findings
Substantial improvements in grasp quality, semantic accuracy, and diversity.
Effective integration of object geometry, affordances, and instructions in grasp generation.
Outperforms state-of-the-art methods on four instruction-augmented benchmarks.
Abstract
Generating human grasping poses that accurately reflect both object geometry and user-specified interaction semantics is essential for natural hand-object interactions in AR/VR and embodied AI. However, existing semantic grasping approaches struggle with the large modality gap between 3D object representations and textual instructions, and often lack explicit spatial or semantic constraints, leading to physically invalid or semantically inconsistent grasps. In this work, we present AffordGrasp, a diffusion-based framework that produces physically stable and semantically faithful human grasps with high precision. We first introduce a scalable annotation pipeline that automatically enriches hand-object interaction datasets with fine-grained structured language labels capturing interaction intent. Building upon these annotations, AffordGrasp integrates an affordance-aware latent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
