VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models
Christos Ziakas, Alessandra Russo

TL;DR
VITA introduces a test-time adaptation method for vision-language models to improve zero-shot value estimation and temporal reasoning in robotic tasks, outperforming existing methods and enabling reward shaping for multi-task policies.
Contribution
VITA is a novel test-time adaptation approach that enhances zero-shot value functions of vision-language models through sequential updates, addressing temporal reasoning and generalization.
Findings
VITA outperforms state-of-the-art zero-shot methods in robotic manipulation tasks.
It enables reward shaping for offline reinforcement learning, improving multi-task policy performance.
VITA generalizes across diverse environments and embodiments from a single training environment.
Abstract
Vision-Language Models (VLMs) show promise as zero-shot goal-conditioned value functions, but their frozen pre-trained representations limit generalization and temporal reasoning. We introduce VITA, a zero-shot value function learning method that enhances both capabilities via test-time adaptation. At inference, a lightweight adaptation module is updated via a gradient step on a meta-learned self-supervised loss, such that each test-time update improves value estimation. By updating sequentially over a trajectory, VITA encodes history into its parameters, addressing the temporal reasoning limitations. To mitigate shortcut learning, we propose a dissimilarity-based sampling strategy that selects semantically diverse segments of the trajectory during training. In real-world robotic manipulation tasks, VITA generalizes from a single training environment to diverse out-of-distribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
