LoCoNet: Long-Short Context Network for Active Speaker Detection
Xizi Wang, Feng Cheng, Gedas Bertasius, David Crandall

TL;DR
LoCoNet introduces a novel neural network architecture that effectively models both long-term intra-speaker and short-term inter-speaker contexts for active speaker detection, achieving state-of-the-art results across multiple datasets.
Contribution
The paper proposes LoCoNet, a simple yet effective model combining self-attention and convolutional blocks to jointly model long- and short-term speaker contexts for improved ASD performance.
Findings
Achieves state-of-the-art mAP scores on multiple datasets.
Outperforms previous methods in challenging multi-speaker scenarios.
Demonstrates significant improvements in detecting small or multiple active speakers.
Abstract
Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a video. ASD reasons from audio and visual information from two contexts: long-term intra-speaker context and short-term inter-speaker context. Long-term intra-speaker context models the temporal dependencies of the same speaker, while short-term inter-speaker context models the interactions of speakers in the same scene. These two contexts are complementary to each other and can help infer the active speaker. Motivated by these observations, we propose LoCoNet, a simple yet effective Long-Short Context Network that models the long-term intra-speaker context and short-term inter-speaker context. We use self-attention to model long-term intra-speaker context due to its effectiveness in modeling long-range dependencies, and convolutional blocks that capture local patterns to model short-term inter-speaker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
