TL;DR
VLA-GSE introduces a novel parameter-efficient fine-tuning framework for vision-language-action models, enhancing robotic control adaptation while preserving pre-trained knowledge, with significant improvements demonstrated across benchmarks.
Contribution
It proposes a spectral decomposition-based expert routing method that improves adaptation capacity under fixed parameter budgets in VLA models.
Findings
VLA-GSE updates only 2.51% of parameters and outperforms FFT and PEFT baselines.
Achieves 81.2% average zero-shot success on LIBERO-Plus.
Preserves pre-trained VLM capabilities comparable to LoRA.
Abstract
Vision-language-action (VLA) models inherit rich visual-semantic priors from pre-trained vision-language backbones, but adapting them to robotic control remains challenging. Full fine-tuning (FFT) is prone to overfitting on downstream robotic data and catastrophic forgetting of pretrained vision-language capabilities. Parameter-efficient fine-tuning (PEFT) better preserves pre-trained knowledge, yet existing PEFT methods still struggle to adapt effectively to robot control tasks. To address this gap, we propose VLA-GSE, a parameter-efficient VLA fine-tuning framework that improves control adaptation while retaining PEFT's knowledge preservation advantage. Specifically, VLA-GSE (Generalized and Specialized Experts) is initialized by spectrally decomposing the frozen backbone, assigning leading singular components to generalized experts (shared experts) and disjoint residual components to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
