VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection

Haowen Sun; Shaolong Zhang; Mingyang Li; Chengzhong Ma; Xinzhe Chen; Qiongjie Cui; Xingyu Chen; Zeyang Liu; Xuguang Lan

arXiv:2605.01365·cs.CV·May 5, 2026

VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection

Haowen Sun, Shaolong Zhang, Mingyang Li, Chengzhong Ma, Xinzhe Chen, Qiongjie Cui, Xingyu Chen, Zeyang Liu, Xuguang Lan

PDF

TL;DR

VoxAfford introduces a multi-scale voxel-token fusion method that enhances open-vocabulary 3D affordance detection by integrating geometric features into language-based segmentation tokens, achieving state-of-the-art results.

Contribution

The paper presents a novel approach that injects multi-scale geometric features into language tokens for improved 3D affordance localization, bypassing limitations of autoregressive token generation.

Findings

01

Achieves approximately 8% improvement in mIoU over previous methods.

02

Demonstrates effective zero-shot transfer to novel objects in robot experiments.

03

Outperforms existing methods on open-vocabulary 3D affordance detection tasks.

Abstract

Open-vocabulary 3D affordance detection requires localizing interaction regions on point clouds given novel affordance descriptions. Recent methods extend multimodal large language models (MLLMs) with special output tokens that are decoded into segmentation masks. However, these tokens are produced through autoregressive generation, which models sequential dependencies rather than spatial neighborhood relations, leaving them semantically rich but spatially impoverished for 3D localization. We propose Voxel-enhanced Affordance detection (VoxAfford), which bypasses this bottleneck by injecting multi-scale geometric features from a frozen pre-trained 3D VQVAE encoder into the output tokens after generation. Each output token uses its affordance semantics as a query to retrieve relevant geometric patterns from its paired voxel scale via cross-attention, with a learned compatibility gate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.