VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models

Christos Ziakas; Alessandra Russo

arXiv:2506.10085·cs.CV·March 3, 2026

VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models

Christos Ziakas, Alessandra Russo

PDF

TL;DR

VITA introduces a test-time adaptation method for vision-language models to improve zero-shot value estimation and temporal reasoning in robotic tasks, outperforming existing methods and enabling reward shaping for multi-task policies.

Contribution

VITA is a novel test-time adaptation approach that enhances zero-shot value functions of vision-language models through sequential updates, addressing temporal reasoning and generalization.

Findings

01

VITA outperforms state-of-the-art zero-shot methods in robotic manipulation tasks.

02

It enables reward shaping for offline reinforcement learning, improving multi-task policy performance.

03

VITA generalizes across diverse environments and embodiments from a single training environment.

Abstract

Vision-Language Models (VLMs) show promise as zero-shot goal-conditioned value functions, but their frozen pre-trained representations limit generalization and temporal reasoning. We introduce VITA, a zero-shot value function learning method that enhances both capabilities via test-time adaptation. At inference, a lightweight adaptation module is updated via a gradient step on a meta-learned self-supervised loss, such that each test-time update improves value estimation. By updating sequentially over a trajectory, VITA encodes history into its parameters, addressing the temporal reasoning limitations. To mitigate shortcut learning, we propose a dissimilarity-based sampling strategy that selects semantically diverse segments of the trajectory during training. In real-world robotic manipulation tasks, VITA generalizes from a single training environment to diverse out-of-distribution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.