A Rapid Deployment Pipeline for Autonomous Humanoid Grasping Based on Foundation Models
Yifei Yan, Yankai Liao, Linqi Ye

TL;DR
This paper introduces a fast, integrated pipeline using foundation models for humanoid robot grasping, reducing deployment time from days to about 30 minutes with high accuracy and versatility.
Contribution
The authors present an end-to-end pipeline that combines foundation models for automatic annotation, 3D reconstruction, and zero-shot pose tracking to enable rapid object manipulation deployment.
Findings
Detection accuracy of [email protected] = 0.995
Pose tracking precision of σ < 1.05 mm
Successful grasping at five workspace positions
Abstract
Deploying a humanoid robot to manipulate a new object has traditionally required one to two days of effort: data collection, manual annotation, 3D model acquisition, and model training. This paper presents an end-to-end rapid deployment pipeline that integrates three foundation-model components to shorten the onboarding cycle for a new object to approximately 30 minutes: (i) Roboflow-based automatic annotation to assist in training a YOLOv8 object detector; (ii) 3D reconstruction based on Meta SAM 3D, which eliminates the need for a dedicated laser scanner; and (iii) zero-shot 6-DoF pose tracking based on FoundationPose, using the SAM~3D-generated mesh directly as the template. The estimated pose drives a Unity-based inverse kinematics planner, whose joint commands are streamed via UDP to a Unitree~G1 humanoid and executed through the Unitree SDK. We demonstrate detection accuracy of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
