HSC-VLA: Hierarchical Scene-Clearing for Robust Bimanual Manipulation in Dense Clutter
Zhen Liu, Xinyu Ning, Zhe Hu, XinXin Xie, Yitong Liu, and Zhongzhu Pu

TL;DR
HSC-VLA introduces a hierarchical approach to dense clutter manipulation, decoupling reasoning from execution, significantly improving success rates in complex, cluttered environments.
Contribution
The paper presents HSC-VLA, a hierarchical framework that enhances bimanual manipulation in cluttered scenes by scene clearing and task decomposition, outperforming monolithic models.
Findings
Achieves 86.7% success in dense clutter, surpassing baseline by 52.4%.
Demonstrates strong performance on long-horizon tasks like sorting and restocking.
Exhibits robustness and effective failure recovery in complex environments.
Abstract
Modern Vision--Language--Action models often suffer from critical instruction-following failures in high-density manipulation environments, where task-irrelevant visual clutter dilutes attention, corrupts grounding, and substantially degrades performance in complex long-horizon scenarios. To overcome the representation bottleneck of monolithic end-to-end architectures, we propose HSC-VLA, a hierarchical framework that decouples high-level visual-semantic reasoning from low-level, high-frequency sensorimotor execution through an explicit scene-clearing abstraction. HSC-VLA employs a high-level Brain to decompose long-horizon tasks and to generate task-specific scene masks that preserve task-relevant geometry while suppressing distractors. The filtered observations are then passed to a low-level Cerebellum, a diffusion-based policy that performs bimanual manipulation using only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis · Advanced Memory and Neural Computing
