Robust Open-Set Spoken Language Identification and the CU MultiLang   Dataset

Mustafa Eyceoz; Justin Lee; Siddharth Pittie; Homayoon Beigi

arXiv:2308.14951·cs.CL·August 30, 2023

Robust Open-Set Spoken Language Identification and the CU MultiLang Dataset

Mustafa Eyceoz, Justin Lee, Siddharth Pittie, Homayoon Beigi

PDF

TL;DR

This paper introduces a novel open-set spoken language identification system that detects unknown languages using MFCC, pitch features, and advanced classification techniques, supported by the new CU MultiLang dataset.

Contribution

The paper presents a new open-set language identification approach and introduces the CU MultiLang dataset for training and evaluation.

Findings

01

Achieved 91.76% accuracy on trained languages

02

Capable of detecting unknown languages on the fly

03

Developed a large, diverse multilingual speech corpus

Abstract

Most state-of-the-art spoken language identification models are closed-set; in other words, they can only output a language label from the set of classes they were trained on. Open-set spoken language identification systems, however, gain the ability to detect when an input exhibits none of the original languages. In this paper, we implement a novel approach to open-set spoken language identification that uses MFCC and pitch features, a TDNN model to extract meaningful feature embeddings, confidence thresholding on softmax outputs, and LDA and pLDA for learning to classify new unknown languages. We present a spoken language identification system that achieves 91.76% accuracy on trained languages and has the capability to adapt to unknown languages on the fly. To that end, we also built the CU MultiLang Dataset, a large and diverse multilingual speech corpus which was used to train and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsNone · Linear Discriminant Analysis · Softmax