Point2Act: Efficient 3D Distillation of Multimodal LLMs for Zero-Shot Context-Aware Grasping
Sang Min Kim, Hyeongjun Heo, Junho Kim, Yonghyeon Lee, Young Min Kim

TL;DR
Point2Act introduces an efficient method for zero-shot, context-aware 3D action localization in robotics by leveraging multimodal large language models and multi-view aggregation, enabling rapid and precise physical actions.
Contribution
It presents a novel 3D relevancy field approach that bypasses high-dimensional features, improving localization accuracy for robotic grasping tasks in unseen environments.
Findings
Achieves scene understanding and action localization in under 20 seconds.
Effectively compensates for occlusion and semantic uncertainties.
Enables precise 3D action point retrieval for manipulation tasks.
Abstract
We propose Point2Act, which directly retrieves the 3D action point relevant to a contextually described task, leveraging Multimodal Large Language Models (MLLMs). Foundation models opened the possibility for generalist robots that can perform a zero-shot task following natural language descriptions within an unseen environment. While the semantics obtained from large-scale image and language datasets provide contextual understanding in 2D images, the rich yet nuanced features deduce blurry 2D regions and struggle to find precise 3D locations for actions. Our proposed 3D relevancy fields bypass the high-dimensional features and instead efficiently imbue lightweight 2D point-level guidance tailored to the task-specific action. The multi-view aggregation effectively compensates for misalignments due to geometric ambiguities, such as occlusion, or semantic uncertainties inherent in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
