Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization

Nikita Kachaev; Mikhail Kolosov; Daniil Zelezetsky; Alexey K. Kovalev; Aleksandr I. Panov

arXiv:2510.25616·cs.LG·October 30, 2025

Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization

Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov

PDF

1 Models 1 Datasets

TL;DR

This paper investigates how fine-tuning Vision-Language-Action models affects their visual representations, revealing degradation issues and proposing alignment strategies to improve out-of-distribution generalization.

Contribution

It systematically analyzes representation retention during VLA fine-tuning and introduces an effective alignment method to mitigate visual degradation and enhance OOD performance.

Findings

01

Naive fine-tuning degrades visual representations.

02

Alignment strategies can recover VL capabilities.

03

Proposed method improves OOD generalization.

Abstract

The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
tttonyalpha/openvla-7b-warmup-checkpoint_lora_002000
model· 3 dl
3 dl

Datasets

tttonyalpha/openvla_1k-dataset
dataset· 90 dl
90 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.