CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models
Catherine Glossop, William Chen, Arjun Bhorkar, Dhruv Shah, Sergey Levine

TL;DR
This paper introduces a method to enhance vision-language-action models for robots by using counterfactual labels to increase language grounding and task diversity, significantly improving instruction-following performance.
Contribution
The authors propose a novel counterfactual relabeling technique that augments existing datasets, boosting the fine-grained instruction-following ability of VLA models without extra data collection.
Findings
Success rate increased by 27% on navigation tasks.
Counterfactual relabeling improves instruction-following performance.
Method achieves state-of-the-art results without additional data collection.
Abstract
Generalist robots should be able to understand and follow user instructions, but current vision-language-action (VLA) models struggle with following fine-grained commands despite providing a powerful architecture for mapping open-vocabulary natural language instructions to robot actions. One cause for this is a lack of semantic diversity and language grounding in existing robot datasets and, specifically, a lack of fine-grained task diversity for similar observations. To address this, we present a novel method to augment existing robot datasets by leveraging vision language models to create counterfactual labels. Our method improves the language-following capabilities of VLAs by increasing the diversity and granularity of language grounding for robot datasets by generating counterfactual language and actions. We evaluate the resulting model's ability to follow language instructions,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
