Multimodal Confidence Modeling in Audio-Visual Quality Assessment
Mayesha Maliha R. Mithila, Mylene C.Q. Farias

TL;DR
This paper introduces MCM-AVQA, a confidence-aware framework for audio-visual quality assessment that improves fusion by estimating and utilizing modality-specific confidence scores, especially under asymmetric distortions.
Contribution
It presents a novel multimodal confidence modeling approach that explicitly estimates confidence for each modality and integrates it into a dedicated mixer for better AVQA performance.
Findings
Improves correlation with human opinion scores on AVQA benchmarks.
Enhances interpretability under real-world asymmetric distortions.
Effectively suppresses unreliable modality signals during fusion.
Abstract
Audio-visual quality assessment (AVQA) is essential for streaming, teleconferencing, and immersive media. In realistic streaming scenarios, distortions are often asymmetric, where one modality may be severely degraded while the other remains clean. Still, most contemporary AVQA metrics treat audio and video as equally reliable, causing confidence-unaware fusion to emphasize unreliable signals. This paper proposes MCM-AVQA, a multimodal confidence-aware AVQA framework that explicitly estimates modality-specific confidence and injects it into a dedicated audio-visual mixer for cross-modal attention. The Audio-Visual Mixer utilizes frame-level, confidence-guided channel attention to gate fusion, modulating feature interaction between modalities so that high-confidence streams dominate while unreliable inputs are suppressed, preserving temporal degradation patterns. A multi-head visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
