Multimodal Modeling For Spoken Language Identification
Shikhar Bharadwaj, Min Ma, Shikhar Vashishth, Ankur Bapna, Sriram, Ganapathy, Vera Axelrod, Siddharth Dalmia, Wei Han, Yu Zhang, Daan van Esch,, Sandy Ritchie, Partha Talukdar, Jason Riesa

TL;DR
This paper introduces MuSeLI, a multimodal approach for spoken language identification that leverages video metadata like titles, descriptions, and location to improve accuracy, achieving state-of-the-art results on YouTube datasets.
Contribution
The paper presents MuSeLI, a novel multimodal framework that incorporates diverse metadata sources to enhance spoken language identification performance.
Findings
Metadata significantly improves language identification accuracy.
MuSeLI outperforms previous single-modality methods.
Each modality contributes uniquely to the overall performance.
Abstract
Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification. Our study reveals that metadata such as video title, description and geographic location provide substantial information to identify the spoken language of the multimedia recording. We conduct experiments using two diverse public datasets of YouTube videos, and obtain state-of-the-art results on the language identification task. We additionally conduct an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
