SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering
Zhe Yang, Wenrui Li, Guanghui Cheng

TL;DR
SHMamba is a novel model that combines hyperbolic geometry and state space modeling to improve audio-visual question answering by better capturing hierarchical and dynamic relationships with fewer parameters.
Contribution
The paper introduces SHMamba, integrating hyperbolic space with state space models, and proposes modules for hierarchical and cross-modal understanding, reducing parameters and enhancing performance.
Findings
Outperforms previous methods in AVQA tasks.
Reduces learnable parameters by 78.12%.
Improves average accuracy by 2.53%.
Abstract
The Audio-Visual Question Answering (AVQA) task holds significant potential for applications. Compared to traditional unimodal approaches, the multi-modal input of AVQA makes feature extraction and fusion processes more challenging. Euclidean space is difficult to effectively represent multi-dimensional relationships of data. Especially when extracting and processing data with a tree structure or hierarchical structure, Euclidean space is not suitable as an embedding space. Additionally, the self-attention mechanism in Transformers is effective in capturing the dynamic relationships between elements in a sequence. However, the self-attention mechanism's limitations in window modeling and quadratic computational complexity reduce its effectiveness in modeling long sequences. To address these limitations, we propose SHMamba: Structured Hyperbolic State Space Model to integrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
