A Rapid Deployment Pipeline for Autonomous Humanoid Grasping Based on Foundation Models

Yifei Yan; Yankai Liao; Linqi Ye

arXiv:2604.17258·cs.RO·April 21, 2026

A Rapid Deployment Pipeline for Autonomous Humanoid Grasping Based on Foundation Models

Yifei Yan, Yankai Liao, Linqi Ye

PDF

TL;DR

This paper introduces a fast, integrated pipeline using foundation models for humanoid robot grasping, reducing deployment time from days to about 30 minutes with high accuracy and versatility.

Contribution

The authors present an end-to-end pipeline that combines foundation models for automatic annotation, 3D reconstruction, and zero-shot pose tracking to enable rapid object manipulation deployment.

Findings

01

Detection accuracy of [email protected] = 0.995

02

Pose tracking precision of σ < 1.05 mm

03

Successful grasping at five workspace positions

Abstract

Deploying a humanoid robot to manipulate a new object has traditionally required one to two days of effort: data collection, manual annotation, 3D model acquisition, and model training. This paper presents an end-to-end rapid deployment pipeline that integrates three foundation-model components to shorten the onboarding cycle for a new object to approximately 30 minutes: (i) Roboflow-based automatic annotation to assist in training a YOLOv8 object detector; (ii) 3D reconstruction based on Meta SAM 3D, which eliminates the need for a dedicated laser scanner; and (iii) zero-shot 6-DoF pose tracking based on FoundationPose, using the SAM~3D-generated mesh directly as the template. The estimated pose drives a Unity-based inverse kinematics planner, whose joint commands are streamed via UDP to a Unitree~G1 humanoid and executed through the Unitree SDK. We demonstrate detection accuracy of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.