DeepSpectrumLite: A Power-Efficient Transfer Learning Framework for   Embedded Speech and Audio Processing from Decentralised Data

Shahin Amiriparian (1); Tobias H\"ubner (1); Maurice Gerczuk (1),; Sandra Ottl (1); Bj\"orn W. Schuller (1,2) ((1) EIHW -- Chair of Embedded; Intelligence for Health Care; Wellbeing; University of Augsburg; Germany,; (2) GLAM -- Group on Language; Audio; and Music; Imperial College London; UK)

arXiv:2104.11629·cs.SD·April 26, 2021·1 cites

DeepSpectrumLite: A Power-Efficient Transfer Learning Framework for Embedded Speech and Audio Processing from Decentralised Data

Shahin Amiriparian (1), Tobias H\"ubner (1), Maurice Gerczuk (1),, Sandra Ottl (1), Bj\"orn W. Schuller (1,2) ((1) EIHW -- Chair of Embedded, Intelligence for Health Care, Wellbeing, University of Augsburg, Germany,, (2) GLAM -- Group on Language, Audio, and Music

PDF

Open Access 1 Repo

TL;DR

DeepSpectrumLite is a lightweight transfer learning framework that enables real-time, on-device speech and audio recognition on embedded devices by fine-tuning pre-trained CNNs on spectrograms, achieving state-of-the-art results with low latency.

Contribution

It introduces a novel, resource-efficient transfer learning pipeline for embedded speech and audio processing using pre-trained CNNs and on-the-fly spectrogram augmentation.

Findings

01

Achieves real-time inference with 242 ms lag on a smartphone.

02

Operates decentralised, eliminating data upload needs.

03

Obtains state-of-the-art results on paralinguistics tasks.

Abstract

Deep neural speech and audio processing systems have a large number of trainable parameters, a relatively complex architecture, and require a vast amount of training data and computational power. These constraints make it more challenging to integrate such systems into embedded devices and utilise them for real-time, real-world applications. We tackle these limitations by introducing DeepSpectrumLite, an open-source, lightweight transfer learning framework for on-device speech and audio recognition using pre-trained image convolutional neural networks (CNNs). The framework creates and augments Mel-spectrogram plots on-the-fly from raw audio signals which are then used to finetune specific pre-trained CNNs for the target classification task. Subsequently, the whole pipeline can be run in real-time with a mean inference lag of 242.0 ms when a DenseNet121 model is used on a consumer-grade…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DeepSpectrum/DeepSpectrumLite
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing