Naver at ActivityNet Challenge 2019 -- Task B Active Speaker Detection (AVA)
Joon Son Chung

TL;DR
This paper presents a novel approach for active speaker detection using a 3D CNN front-end combined with ensemble classifiers, achieving significant improvements on the AVA-ActiveSpeaker dataset.
Contribution
The paper introduces a new deep learning framework combining 3D CNNs and ensemble classifiers for active speaker detection, outperforming previous baselines.
Findings
Significant accuracy improvements over baseline methods
Effective use of 3D CNNs for temporal feature extraction
Ensemble of temporal convolution and LSTM classifiers enhances performance
Abstract
This report describes our submission to the ActivityNet Challenge at CVPR 2019. We use a 3D convolutional neural network (CNN) based front-end and an ensemble of temporal convolution and LSTM classifiers to predict whether a visible person is speaking or not. Our results show significant improvements over the baseline on the AVA-ActiveSpeaker dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsSigmoid Activation · Tanh Activation · Convolution · Long Short-Term Memory
