Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization
Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov

TL;DR
This paper investigates how fine-tuning Vision-Language-Action models affects their visual representations, revealing degradation issues and proposing alignment strategies to improve out-of-distribution generalization.
Contribution
It systematically analyzes representation retention during VLA fine-tuning and introduces an effective alignment method to mitigate visual degradation and enhance OOD performance.
Findings
Naive fine-tuning degrades visual representations.
Alignment strategies can recover VL capabilities.
Proposed method improves OOD generalization.
Abstract
The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
