Multi-Modal Scene Graph with Kolmogorov-Arnold Experts for Audio-Visual Question Answering
Zijian Fu, Changsheng Lv, Mengshi Qi, Huadong Ma

TL;DR
This paper introduces a novel multi-modal scene graph and Kolmogorov-Arnold Expert Network to improve audio-visual question answering by better modeling complex interactions and relationships in video content.
Contribution
It presents the first multi-modal scene graph for audio-visual scenes and integrates a Kolmogorov-Arnold Network-based Mixture of Experts for enhanced temporal reasoning.
Findings
Achieves state-of-the-art results on MUSIC-AVQA benchmarks.
Effectively models fine-grained cross-modal interactions.
Improves temporal reasoning in audio-visual question answering.
Abstract
In this paper, we propose a novel Multi-Modal Scene Graph with Kolmogorov-Arnold Expert Network for Audio-Visual Question Answering (SHRIKE). The task aims to mimic human reasoning by extracting and fusing information from audio-visual scenes, with the main challenge being the identification of question-relevant cues from the complex audio-visual content. Existing methods fail to capture the structural information within video, and suffer from insufficient fine-grained modeling of multi-modal features. To address these issues, we are the first to introduce a new multi-modal scene graph that explicitly models the objects and their relationship as a visually grounded, structured representation of the audio-visual scene. Furthermore, we design a Kolmogorov-Arnold Network~(KAN)-based Mixture of Experts (MoE) to enhance the expressive power of the temporal integration stage. This enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
