Exploring spectro-temporal features in end-to-end convolutional neural   networks

Sean Robertson; Gerald Penn; Yingxue Wang

arXiv:1901.00072·cs.LG·January 3, 2019·6 cites

Exploring spectro-temporal features in end-to-end convolutional neural networks

Sean Robertson, Gerald Penn, Yingxue Wang

PDF

Open Access 1 Repo

TL;DR

This paper investigates alternative filter bank designs, such as Gabor and Gammatone filters, and different feature computation methods in end-to-end CNN speech recognition, but finds no significant performance improvements.

Contribution

It introduces alternative filter bank types and feature computation strategies for speech recognition, providing open-source implementations and analyzing their impact.

Findings

01

No significant reduction in phone error rate with proposed modifications

02

Alternative filters and computation methods do not outperform standard Mel-filter banks

03

Discussion on implications for learned filter banks in CNNs

Abstract

Triangular, overlapping Mel-scaled filters ("f-banks") are the current standard input for acoustic models that exploit their input's time-frequency geometry, because they provide a psycho-acoustically motivated time-frequency geometry for a speech signal. F-bank coefficients are provably robust to small deformations in the scale. In this paper, we explore two ways in which filter banks can be adjusted for the purposes of speech recognition. First, triangular filters can be replaced with Gabor filters, a compactly supported filter that better localizes events in time, or Gammatone filters, a psychoacoustically-motivated filter. Second, by rearranging the order of operations in computing filter bank features, features can be integrated over smaller time scales while simultaneously providing better frequency resolution. We make all feature implementations available online through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sdrobert/more-or-let
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing