VoxVietnam: a Large-Scale Multi-Genre Dataset for Vietnamese Speaker   Recognition

Hoang Long Vu; Phuong Tuan Dat; Pham Thao Nhi; Nguyen Song Hao; Nguyen; Thi Thu Trang

arXiv:2501.00328·cs.SD·January 3, 2025

VoxVietnam: a Large-Scale Multi-Genre Dataset for Vietnamese Speaker Recognition

Hoang Long Vu, Phuong Tuan Dat, Pham Thao Nhi, Nguyen Song Hao, Nguyen, Thi Thu Trang

PDF

Open Access 1 Datasets

TL;DR

VoxVietnam is a large-scale, multi-genre dataset for Vietnamese speaker recognition that highlights the challenges of genre variability and improves model performance when used for training.

Contribution

This paper introduces VoxVietnam, the first extensive multi-genre Vietnamese speaker recognition dataset created with an automated pipeline from public sources.

Findings

01

Multi-genre variability challenges existing models

02

Incorporating VoxVietnam improves recognition performance

03

Large-scale dataset enables more robust speaker recognition

Abstract

Recent research in speaker recognition aims to address vulnerabilities due to variations between enrolment and test utterances, particularly in the multi-genre phenomenon where the utterances are in different speech genres. Previous resources for Vietnamese speaker recognition are either limited in size or do not focus on genre diversity, leaving studies in multi-genre effects unexplored. This paper introduces VoxVietnam, the first multi-genre dataset for Vietnamese speaker recognition with over 187,000 utterances from 1,406 speakers and an automated pipeline to construct a dataset on a large scale from public sources. Our experiments show the challenges posed by the multi-genre phenomenon to models trained on a single-genre dataset, and demonstrate a significant increase in performance upon incorporating the VoxVietnam into the training process. Our experiments are conducted to study…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

hustep-lab/VoxVietnam-Dataset
dataset· 222 dl
222 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsFocus