GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection

Yu Wang; Juhyung Ha; Frangil M. Ramirez; Yuchen Wang; David J. Crandall

arXiv:2512.15707·cs.CV·December 18, 2025

GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection

Yu Wang, Juhyung Ha, Frangil M. Ramirez, Yuchen Wang, David J. Crandall

PDF

Open Access

TL;DR

GateFusion introduces a hierarchical gated cross-modal fusion architecture that enhances active speaker detection by adaptively integrating visual and audio features, achieving state-of-the-art results across multiple benchmarks.

Contribution

The paper proposes a novel Hierarchical Gated Fusion Decoder with auxiliary objectives, improving cross-modal interaction and robustness in active speaker detection.

Findings

01

Achieves new state-of-the-art performance on Ego4D-ASD, UniTalk, and WASD benchmarks.

02

Demonstrates strong generalization in out-of-domain experiments.

03

Shows ablation studies confirming the effectiveness of each component.

Abstract

Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing