Deep Learning Models in Speech Recognition: Measuring GPU Energy   Consumption, Impact of Noise and Model Quantization for Edge Deployment

Aditya Chakravarty

arXiv:2405.01004·cs.SD·May 3, 2024

Deep Learning Models in Speech Recognition: Measuring GPU Energy Consumption, Impact of Noise and Model Quantization for Edge Deployment

Aditya Chakravarty

PDF

Open Access

TL;DR

This paper investigates the energy efficiency, noise impact, and quantization effects on transformer-based speech recognition models deployed on edge devices, providing insights for optimizing on-device ASR systems.

Contribution

It offers a comprehensive analysis of quantization, energy consumption, and noise resilience for ASR models on NVIDIA Jetson devices, highlighting key trade-offs for edge deployment.

Findings

01

FP16 quantization halves energy consumption with minimal performance loss

02

Model size and parameters do not predict noise resilience or energy use

03

Quantization and model size impact energy efficiency and accuracy trade-offs

Abstract

Recent transformer-based ASR models have achieved word-error rates (WER) below 4%, surpassing human annotator accuracy, yet they demand extensive server resources, contributing to significant carbon footprints. The traditional server-based architecture of ASR also presents privacy concerns, alongside reliability and latency issues due to network dependencies. In contrast, on-device (edge) ASR enhances privacy, boosts performance, and promotes sustainability by effectively balancing energy use and accuracy for specific applications. This study examines the effects of quantization, memory demands, and energy consumption on the performance of various ASR model inference on the NVIDIA Jetson Orin Nano. By analyzing WER and transcription speed across models using FP32, FP16, and INT8 quantization on clean and noisy datasets, we highlight the crucial trade-offs between accuracy, speeds,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings