BIAS: A Body-based Interpretable Active Speaker Approach

Tiago Roxo; Joana C. Costa; Pedro R. M. In\'acio; Hugo; Proen\c{c}a

arXiv:2412.05150·cs.CV·December 9, 2024

BIAS: A Body-based Interpretable Active Speaker Approach

Tiago Roxo, Joana C. Costa, Pedro R. M. In\'acio, Hugo, Proen\c{c}a

PDF

Open Access 1 Repo

TL;DR

BIAS introduces a novel, interpretable model that combines audio, face, and body cues to improve active speaker detection in challenging real-world scenarios, outperforming existing methods.

Contribution

The paper presents BIAS, the first model to integrate body-based features with audio and face data for ASD, enhancing performance and interpretability in wild conditions.

Findings

01

BIAS achieves state-of-the-art results in challenging datasets like WASD.

02

The model provides interpretability through attention heatmaps and feature importance.

03

BIAS performs competitively on standard datasets, emphasizing the importance of body cues.

Abstract

State-of-the-art Active Speaker Detection (ASD) approaches heavily rely on audio and facial features to perform, which is not a sustainable approach in wild scenarios. Although these methods achieve good results in the standard AVA-ActiveSpeaker set, a recent wilder ASD dataset (WASD) showed the limitations of such models and raised the need for new approaches. As such, we propose BIAS, a model that, for the first time, combines audio, face, and body information, to accurately predict active speakers in varying/challenging conditions. Additionally, we design BIAS to provide interpretability by proposing a novel use for Squeeze-and-Excitation blocks, namely in attention heatmaps creation and feature importance assessment. For a full interpretability setup, we annotate an ASD-related actions dataset (ASD-Text) to finetune a ViT-GPT2 for text scene description to complement BIAS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tiago-roxo/bias
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing

MethodsSoftmax · Attention Is All You Need