BIAS: A Body-based Interpretable Active Speaker Approach
Tiago Roxo, Joana C. Costa, Pedro R. M. In\'acio, Hugo, Proen\c{c}a

TL;DR
BIAS introduces a novel, interpretable model that combines audio, face, and body cues to improve active speaker detection in challenging real-world scenarios, outperforming existing methods.
Contribution
The paper presents BIAS, the first model to integrate body-based features with audio and face data for ASD, enhancing performance and interpretability in wild conditions.
Findings
BIAS achieves state-of-the-art results in challenging datasets like WASD.
The model provides interpretability through attention heatmaps and feature importance.
BIAS performs competitively on standard datasets, emphasizing the importance of body cues.
Abstract
State-of-the-art Active Speaker Detection (ASD) approaches heavily rely on audio and facial features to perform, which is not a sustainable approach in wild scenarios. Although these methods achieve good results in the standard AVA-ActiveSpeaker set, a recent wilder ASD dataset (WASD) showed the limitations of such models and raised the need for new approaches. As such, we propose BIAS, a model that, for the first time, combines audio, face, and body information, to accurately predict active speakers in varying/challenging conditions. Additionally, we design BIAS to provide interpretability by proposing a novel use for Squeeze-and-Excitation blocks, namely in attention heatmaps creation and feature importance assessment. For a full interpretability setup, we annotate an ASD-related actions dataset (ASD-Text) to finetune a ViT-GPT2 for text scene description to complement BIAS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing
MethodsSoftmax · Attention Is All You Need
