AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement
Zhaofeng Hu, Sifan Zhou, Qinbo Zhang, Rongtao Xu, Qi Su, Ci-Jyun Liang

TL;DR
AnySlot introduces a hierarchical framework that converts language instructions into explicit visual goals for precise, zero-shot slot-level object placement, improving robustness and accuracy in robotic manipulation.
Contribution
The paper presents AnySlot, a novel hierarchical approach with explicit visual goals and introduces SlotBench, a new benchmark for evaluating precise spatial reasoning in robotic tasks.
Findings
AnySlot outperforms flat VLA baselines in zero-shot slot placement.
Explicit visual goals improve spatial robustness and semantic accuracy.
SlotBench provides a comprehensive evaluation platform for structured spatial reasoning.
Abstract
Vision-Language-Action (VLA) policies have emerged as a versatile paradigm for generalist robotic manipulation. However, precise object placement under compositional language instructions remains a major challenge for modern monolithic VLA policies. Slot-level tasks require both reliable slot grounding and sub-centimeter execution accuracy. To this end, we propose AnySlot, a framework that reduces compositional complexity by introducing an explicit spatial visual goal as an intermediate representation between language grounding and control. AnySlot turns language into an explicit visual goal by generating a scene marker, then executes this goal with a goal-conditioned VLA policy. This hierarchical design effectively decouples high-level slot selection from low-level execution, ensuring both semantic accuracy and spatial robustness. Furthermore, recognizing the lack of existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
