AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement

Zhaofeng Hu; Sifan Zhou; Qinbo Zhang; Rongtao Xu; Qi Su; Ci-Jyun Liang

arXiv:2604.10432·cs.RO·April 15, 2026

AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement

Zhaofeng Hu, Sifan Zhou, Qinbo Zhang, Rongtao Xu, Qi Su, Ci-Jyun Liang

PDF

TL;DR

AnySlot introduces a hierarchical framework that converts language instructions into explicit visual goals for precise, zero-shot slot-level object placement, improving robustness and accuracy in robotic manipulation.

Contribution

The paper presents AnySlot, a novel hierarchical approach with explicit visual goals and introduces SlotBench, a new benchmark for evaluating precise spatial reasoning in robotic tasks.

Findings

01

AnySlot outperforms flat VLA baselines in zero-shot slot placement.

02

Explicit visual goals improve spatial robustness and semantic accuracy.

03

SlotBench provides a comprehensive evaluation platform for structured spatial reasoning.

Abstract

Vision-Language-Action (VLA) policies have emerged as a versatile paradigm for generalist robotic manipulation. However, precise object placement under compositional language instructions remains a major challenge for modern monolithic VLA policies. Slot-level tasks require both reliable slot grounding and sub-centimeter execution accuracy. To this end, we propose AnySlot, a framework that reduces compositional complexity by introducing an explicit spatial visual goal as an intermediate representation between language grounding and control. AnySlot turns language into an explicit visual goal by generating a scene marker, then executes this goal with a goal-conditioned VLA policy. This hierarchical design effectively decouples high-level slot selection from low-level execution, ensuring both semantic accuracy and spatial robustness. Furthermore, recognizing the lack of existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.