GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation

Abhay Deshpande; Yuquan Deng; Arijit Ray; Jordi Salvador; Winson Han; Jiafei Duan; Kuo-Hao Zeng; Yuke Zhu; Ranjay Krishna; Rose Hendrix

arXiv:2505.13441·cs.RO·September 16, 2025

GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation

Abhay Deshpande, Yuquan Deng, Arijit Ray, Jordi Salvador, Winson Han, Jiafei Duan, Kuo-Hao Zeng, Yuke Zhu, Ranjay Krishna, Rose Hendrix

PDF

Open Access 1 Models 1 Datasets

TL;DR

GraspMolmo is a novel model for task-oriented robotic grasping that leverages a large synthetic dataset to generalize across diverse instructions and objects, achieving state-of-the-art results in complex real-world tasks.

Contribution

The paper introduces GraspMolmo, a new model trained on PRISM, a large-scale synthetic dataset, enabling generalizable, open-vocabulary, task-oriented grasping in cluttered environments.

Findings

01

Achieves 70% success on complex real-world tasks.

02

Outperforms previous methods with 35% success rate.

03

Demonstrates zero-shot semantic grasping capabilities.

Abstract

We present GrasMolmo, a generalizable open-vocabulary task-oriented grasping (TOG) model. GraspMolmo predicts semantically appropriate, stable grasps conditioned on a natural language instruction and a single RGB-D frame. For instance, given "pour me some tea", GraspMolmo selects a grasp on a teapot handle rather than its body. Unlike prior TOG methods, which are limited by small datasets, simplistic language, and uncluttered scenes, GraspMolmo learns from PRISM, a novel large-scale synthetic dataset of 379k samples featuring cluttered environments and diverse, realistic task descriptions. We fine-tune the Molmo visual-language model on this data, enabling GraspMolmo to generalize to novel open-vocabulary instructions and objects. In challenging real-world evaluations, GraspMolmo achieves state-of-the-art results, with a 70% prediction success on complex tasks, compared to the 35%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
allenai/GraspMolmo
model· 171 dl· ♡ 10
171 dl♡ 10

Datasets

allenai/PRISM
dataset· 63 dl
63 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Motor Control and Adaptation · Action Observation and Synchronization