VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection
Haowen Sun, Shaolong Zhang, Mingyang Li, Chengzhong Ma, Xinzhe Chen, Qiongjie Cui, Xingyu Chen, Zeyang Liu, Xuguang Lan

TL;DR
VoxAfford introduces a multi-scale voxel-token fusion method that enhances open-vocabulary 3D affordance detection by integrating geometric features into language-based segmentation tokens, achieving state-of-the-art results.
Contribution
The paper presents a novel approach that injects multi-scale geometric features into language tokens for improved 3D affordance localization, bypassing limitations of autoregressive token generation.
Findings
Achieves approximately 8% improvement in mIoU over previous methods.
Demonstrates effective zero-shot transfer to novel objects in robot experiments.
Outperforms existing methods on open-vocabulary 3D affordance detection tasks.
Abstract
Open-vocabulary 3D affordance detection requires localizing interaction regions on point clouds given novel affordance descriptions. Recent methods extend multimodal large language models (MLLMs) with special output tokens that are decoded into segmentation masks. However, these tokens are produced through autoregressive generation, which models sequential dependencies rather than spatial neighborhood relations, leaving them semantically rich but spatially impoverished for 3D localization. We propose Voxel-enhanced Affordance detection (VoxAfford), which bypasses this bottleneck by injecting multi-scale geometric features from a frozen pre-trained 3D VQVAE encoder into the output tokens after generation. Each output token uses its affordance semantics as a query to retrieve relevant geometric patterns from its paired voxel scale via cross-attention, with a learned compatibility gate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
