Is Someone Speaking? Exploring Long-term Temporal Features for   Audio-visual Active Speaker Detection

Ruijie Tao; Zexu Pan; Rohan Kumar Das; Xinyuan Qian; Mike Zheng Shou,; Haizhou Li

arXiv:2107.06592·eess.AS·July 27, 2021·24 cites

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou,, Haizhou Li

PDF

Open Access 4 Repos

TL;DR

This paper introduces TalkNet, a novel framework for active speaker detection that leverages both short-term and long-term audio-visual features, improving accuracy over existing methods.

Contribution

The paper proposes a new model, TalkNet, which incorporates long-term temporal features and attention mechanisms for enhanced active speaker detection.

Findings

01

Achieves 3.5% improvement on AVA-ActiveSpeaker dataset

02

Achieves 2.2% improvement on Columbia ASD dataset

03

Demonstrates the effectiveness of long-term feature integration

Abstract

Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and visual temporal encoders for feature representation, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. The experiments demonstrate that TalkNet achieves 3.5% and 2.2% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively. Code has…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis