Learning Trimodal Relation for Audio-Visual Question Answering with   Missing Modality

Kyu Ri Park; Hong Joo Lee; Jung Uk Kim

arXiv:2407.16171·cs.CV·July 25, 2024·1 cites

Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality

Kyu Ri Park, Hong Joo Lee, Jung Uk Kim

PDF

Open Access 1 Repo

TL;DR

This paper introduces a robust audio-visual question answering framework capable of handling missing modalities by leveraging relation-aware generators and diffusion models to recall and enhance incomplete multi-modal data.

Contribution

The paper presents a novel relation-aware missing modal generator and an audio-visual diffusion model to improve AVQA performance with missing modalities, a challenge in real-world scenarios.

Findings

01

Enhanced AVQA accuracy with missing modalities

02

Effective recall of missing modal information

03

Improved multi-modal feature enhancement

Abstract

Recent Audio-Visual Question Answering (AVQA) methods rely on complete visual and audio input to answer questions accurately. However, in real-world scenarios, issues such as device malfunctions and data transmission errors frequently result in missing audio or visual modality. In such cases, existing AVQA methods suffer significant performance degradation. In this paper, we propose a framework that ensures robust AVQA performance even when a modality is missing. First, we propose a Relation-aware Missing Modal (RMM) generator with Relation-aware Missing Modal Recalling (RMMR) loss to enhance the ability of the generator to recall missing modal information by understanding the relationships and context among the available modalities. Second, we design an Audio-Visual Relation-aware (AVR) diffusion model with Audio-Visual Enhancing (AVE) loss to further enhance audio-visual features by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

visualaikhu/missing-avqa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Speech and Audio Processing · Music and Audio Processing

MethodsDiffusion