$\texttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time

Sarthak Kumar Maharana; Saksham Singh Kushwaha; Baoming Zhang; Adrian Rodriguez; Songtao Wei; Yapeng Tian; Yunhui Guo

arXiv:2506.00358·cs.SD·October 28, 2025

$\texttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time

Sarthak Kumar Maharana, Saksham Singh Kushwaha, Baoming Zhang, Adrian Rodriguez, Songtao Wei, Yapeng Tian, Yunhui Guo

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces AVROBUSTBENCH, a comprehensive benchmark for evaluating the robustness of audio-visual recognition models under simultaneous corruptions in both modalities, revealing current models' vulnerabilities and proposing a simple test-time adaptation method.

Contribution

It presents AVROBUSTBENCH, the first benchmark assessing bimodal corruptions, and proposes AV2C, a novel test-time adaptation method for improved robustness.

Findings

01

State-of-the-art models' robustness declines with corruption severity.

02

Existing TTA methods offer minimal improvements under bimodal corruptions.

03

AV2C improves performance by penalizing high-entropy samples.

Abstract

While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur $simultaneously$ in both audio and visual modalities, we introduce $AVROBUSTBENCH$ , a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. $AVROBUSTBENCH$ comprises four audio-visual benchmark datasets, $AUDIOSET-2C$ , $VGGSOUND-2C$ , $KINETICS-2C$ , and $EPICKITCHENS-2C$ , each incorporating 75 bimodal audio-visual corruptions that are $co-occurring$ and $correlated$ . Through extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sarthaxxxxx/av-c-robustness-benchmark
pytorchOfficial

Datasets

sakshamsingh1/av_robust_data
dataset· 138 dl
138 dl

Videos

$\texttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time· slideslive

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsFocus