VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model

Hanqing Wang; Mingyu Liu; Xiaoyu Chen; Chengwei MA; Yiming Zhong; Wenti Yin; Yuhao Liu; Zhiqing Cui; Jiahao Yuan; Lu Dai; Zhiyuan Ma; Hui Xiong

arXiv:2602.09638·cs.CV·February 11, 2026

VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model

Hanqing Wang, Mingyu Liu, Xiaoyu Chen, Chengwei MA, Yiming Zhong, Wenti Yin, Yuhao Liu, Zhiqing Cui, Jiahao Yuan, Lu Dai, Zhiyuan Ma, Hui Xiong

PDF

Open Access

TL;DR

This paper introduces VideoAfford, a multimodal large language model framework that leverages a new large-scale video dataset to improve 3D affordance grounding by incorporating dynamic interaction and spatial reasoning.

Contribution

It presents a novel video-based 3D affordance dataset and a multimodal model that integrates dynamic interaction priors and spatial-aware learning for enhanced affordance understanding.

Findings

01

Outperforms existing methods in 3D affordance grounding tasks.

02

Demonstrates strong generalization to open-world scenarios.

03

Effectively reasons about affordances using dynamic and spatial cues.

Abstract

3D affordance grounding aims to highlight the actionable regions on 3D objects, which is crucial for robotic manipulation. Previous research primarily focused on learning affordance knowledge from static cues such as language and images, which struggle to provide sufficient dynamic interaction context that can reveal temporal and causal cues. To alleviate this predicament, we collect a comprehensive video-based 3D affordance dataset, \textit{VIDA}, which contains 38K human-object-interaction videos covering 16 affordance types, 38 object categories, and 22K point clouds. Based on \textit{VIDA}, we propose a strong baseline: VideoAfford, which activates multimodal large language models with additional affordance segmentation capabilities, enabling both world knowledge reasoning and fine-grained affordance grounding within a unified framework. To enhance action understanding capability,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition