GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection
Yu Wang, Juhyung Ha, Frangil M. Ramirez, Yuchen Wang, David J. Crandall

TL;DR
GateFusion introduces a hierarchical gated cross-modal fusion architecture that enhances active speaker detection by adaptively integrating visual and audio features, achieving state-of-the-art results across multiple benchmarks.
Contribution
The paper proposes a novel Hierarchical Gated Fusion Decoder with auxiliary objectives, improving cross-modal interaction and robustness in active speaker detection.
Findings
Achieves new state-of-the-art performance on Ego4D-ASD, UniTalk, and WASD benchmarks.
Demonstrates strong generalization in out-of-domain experiments.
Shows ablation studies confirming the effectiveness of each component.
Abstract
Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
