TL;DR
This paper introduces MODA, a novel attention mechanism for multimodal models that improves high-level perception, cognition, and emotion understanding by addressing attention deficits through modular duplex attention and alignment strategies.
Contribution
The paper proposes MODA, a new attention mechanism with duplex attention and alignment strategies, enhancing multimodal learning for complex perception and cognition tasks.
Findings
MODA outperforms existing methods on 21 benchmark datasets.
MODA effectively improves fine-grained cognition and emotion understanding.
The approach enhances cross-modal interaction and attention consistency.
Abstract
Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by a generalizable attention architecture. Advanced methods predominantly focus on language-centric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (MODA), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
