CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models

Catherine Glossop; William Chen; Arjun Bhorkar; Dhruv Shah; Sergey Levine

arXiv:2508.13446·cs.RO·August 20, 2025

CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models

Catherine Glossop, William Chen, Arjun Bhorkar, Dhruv Shah, Sergey Levine

PDF

1 Datasets

TL;DR

This paper introduces a method to enhance vision-language-action models for robots by using counterfactual labels to increase language grounding and task diversity, significantly improving instruction-following performance.

Contribution

The authors propose a novel counterfactual relabeling technique that augments existing datasets, boosting the fine-grained instruction-following ability of VLA models without extra data collection.

Findings

01

Success rate increased by 27% on navigation tasks.

02

Counterfactual relabeling improves instruction-following performance.

03

Method achieves state-of-the-art results without additional data collection.

Abstract

Generalist robots should be able to understand and follow user instructions, but current vision-language-action (VLA) models struggle with following fine-grained commands despite providing a powerful architecture for mapping open-vocabulary natural language instructions to robot actions. One cause for this is a lack of semantic diversity and language grounding in existing robot datasets and, specifically, a lack of fine-grained task diversity for similar observations. To address this, we present a novel method to augment existing robot datasets by leveraging vision language models to create counterfactual labels. Our method improves the language-following capabilities of VLAs by increasing the diversity and granularity of language grounding for robot datasets by generating counterfactual language and actions. We evaluate the resulting model's ability to follow language instructions,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

catglossop/CAST-dataset
dataset· 59 dl
59 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.