Spectrogram features for audio and speech analysis

Ian McLoughlin; Lam Pham; Yan Song; Xiaoxiao Miao; Huy Phan; Pengfei Cai; Qing Gu; Jiang Nan; Haoyu Song; Donny Soh

arXiv:2603.14917·eess.AS·March 17, 2026

Spectrogram features for audio and speech analysis

Ian McLoughlin, Lam Pham, Yan Song, Xiaoxiao Miao, Huy Phan, Pengfei Cai, Qing Gu, Jiang Nan, Haoyu Song, Donny Soh

PDF

Open Access

TL;DR

This paper reviews how spectrogram features are used in audio and speech analysis, examining their properties, variations, and how they interact with classifier architectures across different tasks.

Contribution

It provides a comprehensive survey of spectrogram-based representations, analyzing their characteristics and their compatibility with various machine learning models for audio analysis.

Findings

01

Spectrogram parameters significantly affect analysis performance.

02

Different spectrogram configurations suit different audio tasks.

03

The choice of spectrogram features influences classifier effectiveness.

Abstract

Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivator for spectrogram-based representations was their ability to present sound as a two dimensional signal in the time-frequency plane, which not only provides an interpretable physical basis for analysing sound, but also unlocks the use of a wide range of machine learning techniques such as convolutional neural networks, that had been developed for image processing. A spectrogram is a matrix characterised by the resolution and span of its two dimensions, as well as by the representation and scaling of each element. Many possibilities for these three characteristics have been explored by researchers across numerous application areas, with different settings showing affinity for various tasks. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing