Adaptive Capacity Allocation for Vision Language Action Fine-tuning
Donghoon Kim, Minji Bae, Unghui Nam, Gyeonghun Kim, Suyun Lee, Kyuhong Shim, Byonghyo Shim

TL;DR
This paper introduces LoRA-SP, a rank-adaptive fine-tuning method for vision language action models that dynamically allocates capacity, leading to improved generalization and efficiency in robotic manipulation tasks.
Contribution
LoRA-SP adaptively allocates capacity during fine-tuning using an energy-based selection, outperforming fixed-rank methods in robotic vision language models.
Findings
LoRA-SP matches or exceeds full fine-tuning performance with fewer parameters.
It improves multi-task success rates by up to 31.6%.
The method is robust to rank choice and reduces cross-task interference.
Abstract
Vision language action models (VLAs) are increasingly used for Physical AI, but deploying a pre-trained VLA model to unseen environments, embodiments, or tasks still requires adaptation. Parameter-efficient fine-tuning (PEFT), especially LoRA, is common for VLA policies, yet the exposed capacity knob, the rank, does not transfer uniformly: robotics transfer exhibits a higher and task-varying intrinsic rank than language fine-tuning. Small ranks suffice for LLMs (e.g., ), while spectral analyses indicate VLAs may require much larger ranks (e.g., ) or near-full rank, a mismatch that worsens in multi-task settings. We present LoRA-SP (Select-Prune), a rank-adaptive fine-tuning method that replaces fixed-rank updates with input- and layer-wise capacity. LoRA-SP uses an SVD-style parameterization with a small router whose nonnegative scores act as singular…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
