MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding

Zhicheng Zhang; Wuyou Xia; Chenxi Zhao; Zhou Yan; Xiaoqiang Liu; Yongjie Zhu; Wenyu Qin; Pengfei Wan; Di Zhang; Jufeng Yang

arXiv:2507.04635·cs.CV·July 8, 2025

MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding

Zhicheng Zhang, Wuyou Xia, Chenxi Zhao, Zhou Yan, Xiaoqiang Liu, Yongjie Zhu, Wenyu Qin, Pengfei Wan, Di Zhang, Jufeng Yang

PDF

1 Models 1 Video

TL;DR

This paper introduces MODA, a novel attention mechanism for multimodal models that improves high-level perception, cognition, and emotion understanding by addressing attention deficits through modular duplex attention and alignment strategies.

Contribution

The paper proposes MODA, a new attention mechanism with duplex attention and alignment strategies, enhancing multimodal learning for complex perception and cognition tasks.

Findings

01

MODA outperforms existing methods on 21 benchmark datasets.

02

MODA effectively improves fine-grained cognition and emotion understanding.

03

The approach enhances cross-modal interaction and attention consistency.

Abstract

Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by a generalizable attention architecture. Advanced methods predominantly focus on language-centric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (MODA), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
KlingTeam/MODA
model

Videos

MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding· slideslive