Exploring spectro-temporal features in end-to-end convolutional neural networks
Sean Robertson, Gerald Penn, Yingxue Wang

TL;DR
This paper investigates alternative filter bank designs, such as Gabor and Gammatone filters, and different feature computation methods in end-to-end CNN speech recognition, but finds no significant performance improvements.
Contribution
It introduces alternative filter bank types and feature computation strategies for speech recognition, providing open-source implementations and analyzing their impact.
Findings
No significant reduction in phone error rate with proposed modifications
Alternative filters and computation methods do not outperform standard Mel-filter banks
Discussion on implications for learned filter banks in CNNs
Abstract
Triangular, overlapping Mel-scaled filters ("f-banks") are the current standard input for acoustic models that exploit their input's time-frequency geometry, because they provide a psycho-acoustically motivated time-frequency geometry for a speech signal. F-bank coefficients are provably robust to small deformations in the scale. In this paper, we explore two ways in which filter banks can be adjusted for the purposes of speech recognition. First, triangular filters can be replaced with Gabor filters, a compactly supported filter that better localizes events in time, or Gammatone filters, a psychoacoustically-motivated filter. Second, by rearranging the order of operations in computing filter bank features, features can be integrated over smaller time scales while simultaneously providing better frequency resolution. We make all feature implementations available online through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
