Cross-Modal Binary Attention: An Energy-Efficient Fusion Framework for Audio-Visual Learning

Mohamed Saleh; Zahra Ahmadi

arXiv:2602.00701·cs.MM·February 3, 2026

Cross-Modal Binary Attention: An Energy-Efficient Fusion Framework for Audio-Visual Learning

Mohamed Saleh, Zahra Ahmadi

PDF

Open Access

TL;DR

This paper introduces CMQKA, a scalable, energy-efficient cross-modal fusion mechanism for audio-visual learning that enables hierarchical integration of features with state-of-the-art performance.

Contribution

The paper presents CMQKA, a novel linear-complexity attention mechanism, and SNNergy, a hierarchical, energy-efficient fusion framework utilizing event-driven binary operations.

Findings

01

Achieves linear O(N) complexity in cross-modal attention.

02

Outperforms existing methods on CREMA-D, AVE, and UrbanSound8K-AV benchmarks.

03

Maintains high fusion effectiveness with significant energy savings.

Abstract

Effective multimodal fusion requires mechanisms that can capture complex cross-modal dependencies while remaining computationally scalable for real-world deployment. Existing audio-visual fusion approaches face a fundamental trade-off: attention-based methods effectively model cross-modal relationships but incur quadratic computational complexity that prevents hierarchical, multi-scale architectures, while efficient fusion strategies rely on simplistic concatenation that fails to extract complementary cross-modal information. We introduce CMQKA, a novel cross-modal fusion mechanism that achieves linear O(N) complexity through efficient binary operations, enabling scalable hierarchical fusion previously infeasible with conventional attention. CMQKA employs bidirectional cross-modal Query-Key attention to extract complementary spatiotemporal features and uses learnable residual fusion to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Multisensory perception and integration