GSIFN: A Graph-Structured and Interlaced-Masked Multimodal Transformer-based Fusion Network for Multimodal Sentiment Analysis
Yijie Jin

TL;DR
GSIFN introduces a graph-structured, interlaced-masked multimodal Transformer with a self-supervised framework, achieving superior multimodal sentiment analysis performance while reducing computational overhead.
Contribution
The paper proposes GSIFN, a novel multimodal fusion network that effectively balances representation capability and efficiency using graph-structured and interlaced-masked Transformer components.
Findings
Outperforms previous state-of-the-art models on CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets.
Achieves higher accuracy with significantly lower computational overhead.
Demonstrates robustness and efficiency in multimodal sentiment analysis.
Abstract
Multimodal Sentiment Analysis (MSA) leverages multiple data modals to analyze human sentiment. Existing MSA models generally employ cutting-edge multimodal fusion and representation learning-based methods to promote MSA capability. However, there are two key challenges: (i) in existing multimodal fusion methods, the decoupling of modal combinations and tremendous parameter redundancy, lead to insufficient fusion performance and efficiency; (ii) a challenging trade-off exists between representation capability and computational overhead in unimodal feature extractors and encoders. Our proposed GSIFN incorporates two main components to solve these problems: (i) a graph-structured and interlaced-masked multimodal Transformer. It adopts the Interlaced Mask mechanism to construct robust multimodal graph embedding, achieve all-modal-in-one Transformer-based fusion, and greatly reduce the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Advanced Text Analysis Techniques · Emotion and Mood Recognition
MethodsAttention Is All You Need · Linear Layer · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Dense Connections · Tanh Activation · Residual Connection · Multi-Head Attention · Byte Pair Encoding
