LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models
Chanyoung Kim, Minwoo Kim, Minseok Kang, Hyunwoo Kim, Dahuin Jung

TL;DR
LIBERO-Para is a benchmark designed to evaluate the robustness of vision-language-action models to paraphrased instructions, revealing significant performance drops primarily due to lexical and planning-level variations.
Contribution
It introduces LIBERO-Para, a controlled benchmark with a new metric PRIDE, to analyze and quantify paraphrase robustness in VLA models, highlighting their reliance on surface-level cues.
Findings
Models show 22-52 percentage points performance drop under paraphrasing.
Object lexical variation causes large performance drops, even with simple synonyms.
Most failures (80-96%) are due to planning-level divergence, not execution errors.
Abstract
Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
