RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer
Haotian Ni, Yake Wei, Hang Liu, Gong Chen, Chong Peng, Hao Lin, Di Hu

TL;DR
This paper identifies a bias in multimodal Transformers where the attention mechanism favors one modality, and introduces RollingQ, a simple method to restore dynamic cooperation among modalities, improving multimodal fusion performance.
Contribution
The paper reveals the diminishing dynamic adaptability in existing self-attention models and proposes RollingQ to effectively balance attention across modalities, enhancing multimodal Transformer capabilities.
Findings
RollingQ successfully restores the cooperation dynamics in multimodal Transformers.
Experiments show improved multimodal fusion performance with RollingQ.
The method is validated across various multimodal scenarios.
Abstract
Multimodal learning faces challenges in effectively fusing information from diverse modalities, especially when modality quality varies across samples. Dynamic fusion strategies, such as attention mechanism in Transformers, aim to address such challenge by adaptively emphasizing modalities based on the characteristics of input data. However, through amounts of carefully designed experiments, we surprisingly observed that the dynamic adaptability of widely-used self-attention models diminishes. Model tends to prefer one modality regardless of data characteristics. This bias triggers a self-reinforcing cycle that progressively overemphasizes the favored modality, widening the distribution gap in attention keys across modalities and deactivating attention mechanism's dynamic properties. To revive adaptability, we propose a simple yet effective method Rolling Query (RollingQ), which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Multi-Agent Systems and Negotiation
