SHMamba: Structured Hyperbolic State Space Model for Audio-Visual   Question Answering

Zhe Yang; Wenrui Li; Guanghui Cheng

arXiv:2406.09833·cs.AI·July 17, 2024

SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering

Zhe Yang, Wenrui Li, Guanghui Cheng

PDF

Open Access

TL;DR

SHMamba is a novel model that combines hyperbolic geometry and state space modeling to improve audio-visual question answering by better capturing hierarchical and dynamic relationships with fewer parameters.

Contribution

The paper introduces SHMamba, integrating hyperbolic space with state space models, and proposes modules for hierarchical and cross-modal understanding, reducing parameters and enhancing performance.

Findings

01

Outperforms previous methods in AVQA tasks.

02

Reduces learnable parameters by 78.12%.

03

Improves average accuracy by 2.53%.

Abstract

The Audio-Visual Question Answering (AVQA) task holds significant potential for applications. Compared to traditional unimodal approaches, the multi-modal input of AVQA makes feature extraction and fusion processes more challenging. Euclidean space is difficult to effectively represent multi-dimensional relationships of data. Especially when extracting and processing data with a tree structure or hierarchical structure, Euclidean space is not suitable as an embedding space. Additionally, the self-attention mechanism in Transformers is effective in capturing the dynamic relationships between elements in a sequence. However, the self-attention mechanism's limitations in window modeling and quadratic computational complexity reduce its effectiveness in modeling long sequences. To address these limitations, we propose SHMamba: Structured Hyperbolic State Space Model to integrate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques