Exploring Audio Cues for Enhanced Test-Time Video Model Adaptation

Runhao Zeng; Qi Deng; Ronghao Zhang; Shuaicheng Niu; Jian Chen; Xiping Hu; Victor C. M. Leung

arXiv:2506.12481·cs.CV·June 17, 2025

Exploring Audio Cues for Enhanced Test-Time Video Model Adaptation

Runhao Zeng, Qi Deng, Ronghao Zhang, Shuaicheng Niu, Jian Chen, Xiping Hu, Victor C. M. Leung

PDF

Open Access

TL;DR

This paper introduces a novel audio-assisted test-time adaptation method for video models that leverages audio cues to generate pseudo-labels, improving model robustness across various datasets and corruption types.

Contribution

It proposes an innovative approach combining audio classification and language models to enhance video test-time adaptation with personalized adaptation cycles.

Findings

01

Consistently improves performance across multiple datasets.

02

Effective integration of audio cues enhances model robustness.

03

Outperforms existing TTA methods on corrupted video datasets.

Abstract

Test-time adaptation (TTA) aims to boost the generalization capability of a trained model by conducting self-/unsupervised learning during the testing phase. While most existing TTA methods for video primarily utilize visual supervisory signals, they often overlook the potential contribution of inherent audio data. To address this gap, we propose a novel approach that incorporates audio information into video TTA. Our method capitalizes on the rich semantic content of audio to generate audio-assisted pseudo-labels, a new concept in the context of video TTA. Specifically, we propose an audio-to-video label mapping method by first employing pre-trained audio models to classify audio signals extracted from videos and then mapping the audio-based predictions to video label spaces through large language models, thereby establishing a connection between the audio categories and video labels.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image Processing Techniques · Video Coding and Compression Technologies