Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning

Jiangfeng Sun; Sihao He; Zhonghong Ou; Meina Song

arXiv:2508.18322·cs.CV·August 27, 2025

Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning

Jiangfeng Sun, Sihao He, Zhonghong Ou, Meina Song

PDF

1 Video

TL;DR

This paper introduces the Structural-Semantic Unifier (SSU), a novel multimodal fusion framework that leverages modality-specific graphs and contrastive learning to improve sentiment analysis by capturing structural dependencies and semantic alignment.

Contribution

The paper proposes SSU, a new framework that integrates structural and semantic information across modalities using graph construction and contrastive learning, advancing multimodal sentiment analysis.

Findings

01

SSU achieves state-of-the-art results on CMU-MOSI and CMU-MOSEI datasets.

02

SSU reduces computational overhead compared to previous methods.

03

Qualitative analysis shows improved interpretability and nuanced emotional pattern capture.

Abstract

Multimodal sentiment analysis (MSA) aims to infer emotional states by effectively integrating textual, acoustic, and visual modalities. Despite notable progress, existing multimodal fusion methods often neglect modality-specific structural dependencies and semantic misalignment, limiting their quality, interpretability, and robustness. To address these challenges, we propose a novel framework called the Structural-Semantic Unifier (SSU), which systematically integrates modality-specific structural information and cross-modal semantic grounding for enhanced multimodal representations. Specifically, SSU dynamically constructs modality-specific graphs by leveraging linguistic syntax for text and a lightweight, text-guided attention mechanism for acoustic and visual modalities, thus capturing detailed intra-modal relationships and semantic interactions. We further introduce a semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning· underline