# On the Relevance of Auditory-Based Gabor Features for Deep Learning in   Automatic Speech Recognition

**Authors:** Angel Mario Castro Martinez, Sri Harish Mallidi, Bernd T. Meyer

arXiv: 1702.04333 · 2017-02-15

## TL;DR

This study investigates auditory Gabor features' role in enhancing deep learning-based speech recognition, demonstrating their discriminability and robustness across various noisy and challenging conditions.

## Contribution

The paper validates that Gabor filterbank features improve speech recognition by providing more discriminable cues, especially in challenging environments, and introduces a similarity measure linking phoneme class distinctions to acoustic properties.

## Key findings

- High temporal modulation frequencies (16-25 Hz) yield the best recognition performance.
- Gabor features outperform Mel-filterbank features in noisy and distorted conditions.
- A new similarity measure correlates phoneme class distinctions with recognition accuracy.

## Abstract

Previous studies support the idea of merging auditory-based Gabor features with deep learning architectures to achieve robust automatic speech recognition, however, the cause behind the gain of such combination is still unknown. We believe these representations provide the deep learning decoder with more discriminable cues. Our aim with this paper is to validate this hypothesis by performing experiments with three different recognition tasks (Aurora 4, CHiME 2 and CHiME 3) and assess the discriminability of the information encoded by Gabor filterbank features. Additionally, to identify the contribution of low, medium and high temporal modulation frequencies subsets of the Gabor filterbank were used as features (dubbed LTM, MTM and HTM respectively). With temporal modulation frequencies between 16 and 25 Hz, HTM consistently outperformed the remaining ones in every condition, highlighting the robustness of these representations against channel distortions, low signal-to-noise ratios and acoustically challenging real-life scenarios with relative improvements from 11 to 56% against a Mel-filterbank-DNN baseline. To explain the results, a measure of similarity between phoneme classes from DNN activations is proposed and linked to their acoustic properties. We find this measure to be consistent with the observed error rates and highlight specific differences on phoneme level to pinpoint the benefit of the proposed features.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1702.04333/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/1702.04333/full.md

## References

57 references — full list in the complete paper: https://tomesphere.com/paper/1702.04333/full.md

---
Source: https://tomesphere.com/paper/1702.04333