Hybrid Hypergraph Networks for Multimodal Sequence Data Classification
Feng Xu, Hui Wang, Yuting Huang, Danwei Zhang, Zizhu Fan

TL;DR
This paper introduces a hybrid hypergraph network that models temporal multimodal data by segmenting sequences into nodes and capturing complex intra- and inter-modal relationships, achieving state-of-the-art classification results.
Contribution
The paper presents a novel hybrid hypergraph framework that effectively models temporal dependencies and cross-modal interactions in multimodal sequence data.
Findings
Achieves state-of-the-art results on four multimodal datasets.
Effectively captures high-order intra-modal dependencies.
Enhances multimodal classification accuracy.
Abstract
Modeling temporal multimodal data poses significant challenges in classification tasks, particularly in capturing long-range temporal dependencies and intricate cross-modal interactions. Audiovisual data, as a representative example, is inherently characterized by strict temporal order and diverse modalities. Effectively leveraging the temporal structure is essential for understanding both intra-modal dynamics and inter-modal correlations. However, most existing approaches treat each modality independently and rely on shallow fusion strategies, which overlook temporal dependencies and hinder the model's ability to represent complex structural relationships. To address the limitation, we propose the hybrid hypergraph network (HHN), a novel framework that models temporal multimodal data via a segmentation-first, graph-later strategy. HHN splits sequences into timestamped segments as nodes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Emotion and Mood Recognition
