Multiclass Language Identification using Deep Learning on Spectral   Images of Audio Signals

Shauna Revay; Matthew Teschke

arXiv:1905.04348·cs.SD·May 14, 2019·38 cites

Multiclass Language Identification using Deep Learning on Spectral Images of Audio Signals

Shauna Revay, Matthew Teschke

PDF

Open Access

TL;DR

This paper introduces LIFAS, a deep learning approach using spectrograms and CNNs for accurate multiclass language identification from short audio clips, with minimal pre-processing.

Contribution

It presents a novel application of CNNs to spectrograms for language identification, achieving high accuracy with minimal pre-processing.

Findings

01

Binary classification accuracy of 97%

02

Multi-class classification accuracy of 89% for six languages

03

Effective use of deep learning on spectrograms for language detection

Abstract

The first step in any voice recognition software is to determine what language a speaker is using, and ideally this process would be automated. The technique described in this paper, language identification for audio spectrograms (LIFAS), uses spectrograms generated from audio signals as inputs to a convolutional neural network (CNN) to be used for language identification. LIFAS requires minimal pre-processing on the audio signals as the spectrograms are generated during each batch as they are input to the network during training. LIFAS utilizes deep learning tools that are shown to be successful on image processing tasks and applies it to audio signal classification. LIFAS performs binary language classification with an accuracy of 97\%, and multi-class classification with six languages at an accuracy of 89\% on 3.75 second audio clips.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing