Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality
Kyu Ri Park, Hong Joo Lee, Jung Uk Kim

TL;DR
This paper introduces a robust audio-visual question answering framework capable of handling missing modalities by leveraging relation-aware generators and diffusion models to recall and enhance incomplete multi-modal data.
Contribution
The paper presents a novel relation-aware missing modal generator and an audio-visual diffusion model to improve AVQA performance with missing modalities, a challenge in real-world scenarios.
Findings
Enhanced AVQA accuracy with missing modalities
Effective recall of missing modal information
Improved multi-modal feature enhancement
Abstract
Recent Audio-Visual Question Answering (AVQA) methods rely on complete visual and audio input to answer questions accurately. However, in real-world scenarios, issues such as device malfunctions and data transmission errors frequently result in missing audio or visual modality. In such cases, existing AVQA methods suffer significant performance degradation. In this paper, we propose a framework that ensures robust AVQA performance even when a modality is missing. First, we propose a Relation-aware Missing Modal (RMM) generator with Relation-aware Missing Modal Recalling (RMMR) loss to enhance the ability of the generator to recall missing modal information by understanding the relationships and context among the available modalities. Second, we design an Audio-Visual Relation-aware (AVR) diffusion model with Audio-Visual Enhancing (AVE) loss to further enhance audio-visual features by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Speech and Audio Processing · Music and Audio Processing
MethodsDiffusion
