LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

Chanyoung Kim; Minwoo Kim; Minseok Kang; Hyunwoo Kim; Dahuin Jung

arXiv:2603.28301·cs.LG·March 31, 2026

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

Chanyoung Kim, Minwoo Kim, Minseok Kang, Hyunwoo Kim, Dahuin Jung

PDF

1 Repo 1 Datasets

TL;DR

LIBERO-Para is a benchmark designed to evaluate the robustness of vision-language-action models to paraphrased instructions, revealing significant performance drops primarily due to lexical and planning-level variations.

Contribution

It introduces LIBERO-Para, a controlled benchmark with a new metric PRIDE, to analyze and quantify paraphrase robustness in VLA models, highlighting their reliance on surface-level cues.

Findings

01

Models show 22-52 percentage points performance drop under paraphrasing.

02

Object lexical variation causes large performance drops, even with simple synonyms.

03

Most failures (80-96%) are due to planning-level divergence, not execution errors.

Abstract

Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cau-hai-lab/LIBERO-Para
github

Datasets

HAI-Lab/LIBERO-Para
dataset· 135 dl
135 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.