Limited Linguistic Diversity in Embodied AI Datasets
Selma Wanna, Agnes Luhtaru, Jonathan Salfity, Ryan Barron, Juston Moore, Cynthia Matuszek, Mitch Pryor

TL;DR
This paper systematically audits popular Vision-Language-Action datasets, revealing they contain repetitive, template-like instructions with limited linguistic diversity, impacting model training and evaluation.
Contribution
It provides a detailed analysis of the linguistic characteristics of VLA datasets, highlighting their lack of diversity and suggesting improvements for dataset design.
Findings
Datasets rely on highly repetitive commands
Limited structural variation in instructions
Narrow distribution of instruction forms
Abstract
Language plays a critical role in Vision-Language-Action (VLA) models, yet the linguistic characteristics of the datasets used to train and evaluate these systems remain poorly documented. In this work, we present a systematic dataset audit of several widely used VLA corpora, aiming to characterize what kinds of instructions these datasets actually contain and how much linguistic variety they provide. We quantify instruction language along complementary dimensions--including lexical variety, duplication and overlap, semantic similarity, and syntactic complexity. Our analysis shows that many datasets rely on highly repetitive, template-like commands with limited structural variation, yielding a narrow distribution of instruction forms. We position these findings as descriptive documentation of the language signal available in current VLA training and evaluation data, intended to support…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
