Cross-Modal Binary Attention: An Energy-Efficient Fusion Framework for Audio-Visual Learning
Mohamed Saleh, Zahra Ahmadi

TL;DR
This paper introduces CMQKA, a scalable, energy-efficient cross-modal fusion mechanism for audio-visual learning that enables hierarchical integration of features with state-of-the-art performance.
Contribution
The paper presents CMQKA, a novel linear-complexity attention mechanism, and SNNergy, a hierarchical, energy-efficient fusion framework utilizing event-driven binary operations.
Findings
Achieves linear O(N) complexity in cross-modal attention.
Outperforms existing methods on CREMA-D, AVE, and UrbanSound8K-AV benchmarks.
Maintains high fusion effectiveness with significant energy savings.
Abstract
Effective multimodal fusion requires mechanisms that can capture complex cross-modal dependencies while remaining computationally scalable for real-world deployment. Existing audio-visual fusion approaches face a fundamental trade-off: attention-based methods effectively model cross-modal relationships but incur quadratic computational complexity that prevents hierarchical, multi-scale architectures, while efficient fusion strategies rely on simplistic concatenation that fails to extract complementary cross-modal information. We introduce CMQKA, a novel cross-modal fusion mechanism that achieves linear O(N) complexity through efficient binary operations, enabling scalable hierarchical fusion previously infeasible with conventional attention. CMQKA employs bidirectional cross-modal Query-Key attention to extract complementary spatiotemporal features and uses learnable residual fusion to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Multisensory perception and integration
