$\texttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time
Sarthak Kumar Maharana, Saksham Singh Kushwaha, Baoming Zhang, Adrian Rodriguez, Songtao Wei, Yapeng Tian, Yunhui Guo

TL;DR
This paper introduces AVROBUSTBENCH, a comprehensive benchmark for evaluating the robustness of audio-visual recognition models under simultaneous corruptions in both modalities, revealing current models' vulnerabilities and proposing a simple test-time adaptation method.
Contribution
It presents AVROBUSTBENCH, the first benchmark assessing bimodal corruptions, and proposes AV2C, a novel test-time adaptation method for improved robustness.
Findings
State-of-the-art models' robustness declines with corruption severity.
Existing TTA methods offer minimal improvements under bimodal corruptions.
AV2C improves performance by penalizing high-entropy samples.
Abstract
While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur in both audio and visual modalities, we introduce , a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. comprises four audio-visual benchmark datasets, , , , and , each incorporating 75 bimodal audio-visual corruptions that are and . Through extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsFocus
