TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning

Soumya Shamarao Jahagirdar; Edson Araujo; Anna Kukleva; M. Jehanzeb Mirza; Saurabhchand Bhati; Samuel Thomas; Brian Kingsbury; Rogerio Feris; James R. Glass; Hilde Kuehne

arXiv:2604.00696·cs.CV·April 2, 2026

TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning

Soumya Shamarao Jahagirdar, Edson Araujo, Anna Kukleva, M. Jehanzeb Mirza, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Rogerio Feris, James R. Glass, Hilde Kuehne

PDF

TL;DR

TTA-Vid introduces a test-time reinforcement learning approach for video reasoning that adapts a pretrained model on incoming video data without labels, improving performance across tasks.

Contribution

It proposes a novel test-time adaptation method combining reasoning and reward-based updates, enabling models to generalize across datasets without additional training.

Findings

01

Outperforms state-of-the-art methods on multiple video reasoning tasks.

02

Requires no ground-truth annotations during adaptation.

03

Generalizes effectively across different datasets.

Abstract

Recent video reasoning models have shown strong results on temporal and multimodal understanding, yet they depend on large-scale supervised data and multi-stage training pipelines, making them costly to train and difficult to adapt to new domains. In this work, we leverage the paradigm of Test-Time Reinforcement Learning on video-language data to allow for adapting a pretrained model to incoming video samples at test-time without explicit labels. The proposed test-time adaptation for video approach (TTA-Vid) combines two components that work simultaneously: (1) a test-time adaptation that performs step-by-step reasoning at inference time on multiple frame subsets. We then use a batch-aware frequency-based reward computed across different frame subsets as pseudo ground truth to update the model. It shows that the resulting model trained on a single batch or even a single sample from a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.