Multimodal Transformers are Hierarchical Modal-wise Heterogeneous Graphs

Yijie Jin; Junjie Peng; Xuanchao Lin; Haochen Yuan; Lan Wang; Cangzhi Zheng

arXiv:2505.01068·cs.CL·August 25, 2025

Multimodal Transformers are Hierarchical Modal-wise Heterogeneous Graphs

Yijie Jin, Junjie Peng, Xuanchao Lin, Haochen Yuan, Lan Wang, Cangzhi Zheng

PDF

TL;DR

This paper introduces GsiT, a graph-structured multimodal transformer that improves efficiency and performance in multimodal sentiment analysis by modeling MulTs as hierarchical heterogeneous graphs and using an interlaced mask mechanism.

Contribution

It formalizes MulTs as hierarchical modal-wise heterogeneous graphs and proposes GsiT with an interlaced mask for efficient, high-performance multimodal fusion.

Findings

01

GsiT achieves 1/3 parameter reduction compared to traditional MulTs.

02

GsiT significantly outperforms traditional MulTs on MSA datasets.

03

The HMHG concept improves model efficiency and effectiveness.

Abstract

Multimodal Sentiment Analysis (MSA) is a rapidly developing field that integrates multimodal information to recognize sentiments, and existing models have made significant progress in this area. The central challenge in MSA is multimodal fusion, which is predominantly addressed by Multimodal Transformers (MulTs). Although act as the paradigm, MulTs suffer from efficiency concerns. In this work, from the perspective of efficiency optimization, we propose and prove that MulTs are hierarchical modal-wise heterogeneous graphs (HMHGs), and we introduce the graph-structured representation pattern of MulTs. Based on this pattern, we propose an Interlaced Mask (IM) mechanism to design the Graph-Structured and Interlaced-Masked Multimodal Transformer (GsiT). It is formally equivalent to MulTs which achieves an efficient weight-sharing mechanism without information disorder through IM, enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Multi-Head Attention · Dense Connections · Adam · Attention Is All You Need · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax