EACELEB: An East Asian Language Speaking Celebrity Dataset for Speaker Recognition
Desmond Caulley, Yufeng Yang, David Anderson

TL;DR
This paper introduces EACELEB, a new East Asian celebrity speaker dataset created using an efficient audio-visual data collection pipeline from YouTube, achieving competitive speaker recognition accuracy.
Contribution
It presents a novel fast data acquisition method using face-tracking for East Asian celebrities, and demonstrates its effectiveness for speaker recognition tasks.
Findings
Achieved approximately 4% equal error rate after diarization and fine-tuning.
Developed a scalable pipeline for collecting celebrity audio data from YouTube.
Showed comparable performance to Voxceleb on East Asian celebrity data.
Abstract
Large datasets are very useful for training speaker recognition systems, and various research groups have constructed several over the years. Voxceleb is a large dataset for speaker recognition that is extracted from Youtube videos. This paper presents an audio-visual method for acquiring audio data from Youtube given the speaker's name as input. The system follows a pipeline similar to that of the Voxceleb data acquisition method. However, our work focuses on fast data acquisition by using face-tracking in subsequent frames once a face has been detected -- this is preferable over face detection for every frame considering its computational cost. We show that applying audio diarization to our data after acquiring it can yield equal error rates comparable to Voxceleb. A secondary set of experiments showed that we could further decrease the error rate by fine-tuning a pre-trained x-vector…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Face recognition and analysis
