CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models

Zhi Liu

arXiv:2605.21854·cs.CV·May 22, 2026

CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models

Zhi Liu

PDF

1 Repo

TL;DR

CrossVLA introduces novel post-training optimization techniques for vision-language-action models, improving preference alignment and inference efficiency across different paradigms.

Contribution

It presents a surrogate flow-matching estimator, compares parameter-efficient layers, and analyzes inference bottlenecks, with all code openly available.

Findings

01

DoRA outperforms OpenVLA SFT by +10.4 pp on LIBERO suite

02

Inference denoise loop accounts for 78.6% latency

03

Pretraining on LIBERO frames yields 99.5% recall@1

Abstract

Vision-Language-Action (VLA) models have rapidly converged on a small set of architectural patterns: discrete-token autoregression (e.g. OpenVLA) and continuous-action flow-matching (e.g. pi-0.5). Yet preference alignment via Direct Preference Optimisation (DPO) -- the de-facto post-training step in language models -- has been studied almost exclusively on autoregressive VLAs. We present CrossVLA, an empirical study of cross-paradigm VLA post-training. Three contributions: (i) a surrogate flow-matching log-probability estimator that lets DPO operate on continuous-action backbones without probability-flow ODE integration; (ii) a head-to-head comparison of LoRA and DoRA as the parameter-efficient layer for VLA DPO, finding DoRA improves over OpenVLA SFT by a mean +10.4 pp across LIBERO 4-suite (600 trials, 3 seeds) -- per-suite +20.0 Object, +11.0 Long-horizon, +8.0 Goal, +2.7 Spatial --…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lz-googlefycy/vla-lab
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.