Multi-Modal Scene Graph with Kolmogorov-Arnold Experts for Audio-Visual Question Answering

Zijian Fu; Changsheng Lv; Mengshi Qi; Huadong Ma

arXiv:2511.23304·cs.AI·December 1, 2025

Multi-Modal Scene Graph with Kolmogorov-Arnold Experts for Audio-Visual Question Answering

Zijian Fu, Changsheng Lv, Mengshi Qi, Huadong Ma

PDF

Open Access

TL;DR

This paper introduces a novel multi-modal scene graph and Kolmogorov-Arnold Expert Network to improve audio-visual question answering by better modeling complex interactions and relationships in video content.

Contribution

It presents the first multi-modal scene graph for audio-visual scenes and integrates a Kolmogorov-Arnold Network-based Mixture of Experts for enhanced temporal reasoning.

Findings

01

Achieves state-of-the-art results on MUSIC-AVQA benchmarks.

02

Effectively models fine-grained cross-modal interactions.

03

Improves temporal reasoning in audio-visual question answering.

Abstract

In this paper, we propose a novel Multi-Modal Scene Graph with Kolmogorov-Arnold Expert Network for Audio-Visual Question Answering (SHRIKE). The task aims to mimic human reasoning by extracting and fusing information from audio-visual scenes, with the main challenge being the identification of question-relevant cues from the complex audio-visual content. Existing methods fail to capture the structural information within video, and suffer from insufficient fine-grained modeling of multi-modal features. To address these issues, we are the first to introduce a new multi-modal scene graph that explicitly models the objects and their relationship as a visually grounded, structured representation of the audio-visual scene. Furthermore, we design a Kolmogorov-Arnold Network~(KAN)-based Mixture of Experts (MoE) to enhance the expressive power of the temporal integration stage. This enables…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection