Multiclass Language Identification using Deep Learning on Spectral Images of Audio Signals
Shauna Revay, Matthew Teschke

TL;DR
This paper introduces LIFAS, a deep learning approach using spectrograms and CNNs for accurate multiclass language identification from short audio clips, with minimal pre-processing.
Contribution
It presents a novel application of CNNs to spectrograms for language identification, achieving high accuracy with minimal pre-processing.
Findings
Binary classification accuracy of 97%
Multi-class classification accuracy of 89% for six languages
Effective use of deep learning on spectrograms for language detection
Abstract
The first step in any voice recognition software is to determine what language a speaker is using, and ideally this process would be automated. The technique described in this paper, language identification for audio spectrograms (LIFAS), uses spectrograms generated from audio signals as inputs to a convolutional neural network (CNN) to be used for language identification. LIFAS requires minimal pre-processing on the audio signals as the spectrograms are generated during each batch as they are input to the network during training. LIFAS utilizes deep learning tools that are shown to be successful on image processing tasks and applies it to audio signal classification. LIFAS performs binary language classification with an accuracy of 97\%, and multi-class classification with six languages at an accuracy of 89\% on 3.75 second audio clips.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
