Naver at ActivityNet Challenge 2019 -- Task B Active Speaker Detection   (AVA)

Joon Son Chung

arXiv:1906.10555·cs.SD·June 26, 2019·31 cites

Naver at ActivityNet Challenge 2019 -- Task B Active Speaker Detection (AVA)

Joon Son Chung

PDF

Open Access

TL;DR

This paper presents a novel approach for active speaker detection using a 3D CNN front-end combined with ensemble classifiers, achieving significant improvements on the AVA-ActiveSpeaker dataset.

Contribution

The paper introduces a new deep learning framework combining 3D CNNs and ensemble classifiers for active speaker detection, outperforming previous baselines.

Findings

01

Significant accuracy improvements over baseline methods

02

Effective use of 3D CNNs for temporal feature extraction

03

Ensemble of temporal convolution and LSTM classifiers enhances performance

Abstract

This report describes our submission to the ActivityNet Challenge at CVPR 2019. We use a 3D convolutional neural network (CNN) based front-end and an ensemble of temporal convolution and LSTM classifiers to predict whether a visible person is speaking or not. Our results show significant improvements over the baseline on the AVA-ActiveSpeaker dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsSigmoid Activation · Tanh Activation · Convolution · Long Short-Term Memory