Reflective VLM Planning for Dual-Arm Desktop Cleaning: Bridging Open-Vocabulary Perception and Precise Manipulation
Yufan Liu, Yi Wu, Gweneth Ge, Haoliang Cheng, and Rui Liu

TL;DR
This paper introduces a hierarchical dual-arm cleaning system that combines open-vocabulary perception with precise manipulation, significantly improving task success rates in simulated desktop cleaning scenarios.
Contribution
It presents a novel framework integrating reflective VLM planning with structured scene memory and dual-arm execution for open-vocabulary and precise cleaning tasks.
Findings
Achieved 87.2% task completion rate in simulations.
Improved performance by 28.8% over static VLM.
Enhanced robustness and generalization with structured memory.
Abstract
Desktop cleaning demands open-vocabulary recognition and precise manipulation for heterogeneous debris. We propose a hierarchical framework integrating reflective Vision-Language Model (VLM) planning with dual-arm execution via structured scene representation. Grounded-SAM2 facilitates open-vocabulary detection, while a memory-augmented VLM generates, critiques, and revises manipulation sequences. These sequences are converted into parametric trajectories for five primitives executed by coordinated Franka arms. Evaluated in simulated scenarios, our system achieving 87.2% task completion, a 28.8% improvement over static VLM and 36.2% over single-arm baselines. Structured memory integration proves crucial for robust, generalizable manipulation while maintaining real-time control performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
