GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

Mingleyang Li; Yuran Wang; Yue Chen; Tianxing Chen; Jiaqi Liang; Zishun Shen; Haoran Lu; Ruihai Wu; Hao Dong

arXiv:2603.04158·cs.RO·March 5, 2026

GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

Mingleyang Li, Yuran Wang, Yue Chen, Tianxing Chen, Jiaqi Liang, Zishun Shen, Haoran Lu, Ruihai Wu, Hao Dong

PDF

Open Access

TL;DR

This paper introduces GarmentPile++, a novel vision-language reasoning system that effectively retrieves individual garments from cluttered piles for home-assistant robotics, integrating segmentation, affordance perception, and dual-arm cooperation.

Contribution

It presents a new retrieval pipeline combining high-level reasoning with visual affordance perception, enhanced by segmentation and a dual-arm framework for complex garment handling.

Findings

01

Effective garment retrieval in cluttered piles demonstrated in real-world and simulation environments.

02

Robust segmentation and reasoning improve retrieval accuracy and safety.

03

Dual-arm cooperation handles large or sagging garments efficiently.

Abstract

Garment manipulation has attracted increasing attention due to its critical role in home-assistant robotics. However, the majority of existing garment manipulation works assume an initial state consisting of only one garment, while piled garments are far more common in real-world settings. To bridge this gap, we propose a novel garment retrieval pipeline that can not only follow language instruction to execute safe and clean retrieval but also guarantee exactly one garment is retrieved per attempt, establishing a robust foundation for the execution of downstream tasks (e.g., folding, hanging, wearing). Our pipeline seamlessly integrates vision-language reasoning with visual affordance perception, fully leveraging the high-level reasoning and planning capabilities of VLMs alongside the generalization power of visual affordance for low-level actions. To enhance the VLM's comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis