LoCoNet: Long-Short Context Network for Active Speaker Detection

Xizi Wang; Feng Cheng; Gedas Bertasius; David Crandall

arXiv:2301.08237·cs.CV·April 2, 2024·1 cites

LoCoNet: Long-Short Context Network for Active Speaker Detection

Xizi Wang, Feng Cheng, Gedas Bertasius, David Crandall

PDF

Open Access 2 Repos 1 Models

TL;DR

LoCoNet introduces a novel neural network architecture that effectively models both long-term intra-speaker and short-term inter-speaker contexts for active speaker detection, achieving state-of-the-art results across multiple datasets.

Contribution

The paper proposes LoCoNet, a simple yet effective model combining self-attention and convolutional blocks to jointly model long- and short-term speaker contexts for improved ASD performance.

Findings

01

Achieves state-of-the-art mAP scores on multiple datasets.

02

Outperforms previous methods in challenging multi-speaker scenarios.

03

Demonstrates significant improvements in detecting small or multiple active speakers.

Abstract

Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a video. ASD reasons from audio and visual information from two contexts: long-term intra-speaker context and short-term inter-speaker context. Long-term intra-speaker context models the temporal dependencies of the same speaker, while short-term inter-speaker context models the interactions of speakers in the same scene. These two contexts are complementary to each other and can help infer the active speaker. Motivated by these observations, we propose LoCoNet, a simple yet effective Long-Short Context Network that models the long-term intra-speaker context and short-term inter-speaker context. We use self-attention to model long-term intra-speaker context due to its effectiveness in modeling long-range dependencies, and convolutional blocks that capture local patterns to model short-term inter-speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
Superxixixi/LoCoNet_ASD
model· 4 dl· ♡ 2
4 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing