Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution

Dennis Fucci; Marco Gaido; Matteo Negri; Mauro Cettolo; Luisa Bentivogli

arXiv:2506.02181·cs.CL·June 4, 2025

Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution

Dennis Fucci, Marco Gaido, Matteo Negri, Mauro Cettolo, Luisa Bentivogli

PDF

Open Access

TL;DR

This paper uses feature attribution to identify key acoustic cues in a modern ASR system, revealing how it processes different phonemes and highlighting differences between male and female speech.

Contribution

It applies a feature attribution technique to a state-of-the-art Conformer ASR model, providing detailed insights into the acoustic cues it relies on, which was not done in prior studies.

Findings

01

ASR relies on vowels' full time spans and first two formants, especially in male speech.

02

Spectral features of sibilant fricatives are more salient than non-sibilants.

03

The model emphasizes the release phase in plosives, focusing on burst characteristics.

Abstract

Despite significant advances in ASR, the specific acoustic cues models rely on remain unclear. Prior studies have examined such cues on a limited set of phonemes and outdated models. In this work, we apply a feature attribution technique to identify the relevant acoustic cues for a modern Conformer-based ASR system. By analyzing plosives, fricatives, and vowels, we assess how feature attributions align with their acoustic properties in the time and frequency domains, also essential for human speech perception. Our findings show that the ASR model relies on vowels' full time spans, particularly their first two formants, with greater saliency in male speech. It also better captures the spectral characteristics of sibilant fricatives than non-sibilants and prioritizes the release phase in plosives, especially burst characteristics. These insights enhance the interpretability of ASR models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing