TL;DR
The paper introduces a deep scattering spectrum that extends traditional audio representations with wavelet-based transforms, achieving state-of-the-art results in music genre and phoneme classification tasks.
Contribution
It proposes a novel scattering transform framework that captures translation and deformation invariances, improving audio classification performance.
Findings
Achieved state-of-the-art accuracy on GTZAN genre classification.
Attained top results on TIMIT phoneme classification.
Demonstrated robustness to time-warping and frequency transposition.
Abstract
A scattering transform defines a locally translation invariant representation which is stable to time-warping deformations. It extends MFCC representations by computing modulation spectrum coefficients of multiple orders, through cascades of wavelet convolutions and modulus operators. Second-order scattering coefficients characterize transient phenomena such as attacks and amplitude modulation. A frequency transposition invariant representation is obtained by applying a scattering transform along log-frequency. State-the-of-art classification results are obtained for musical genre and phone classification on GTZAN and TIMIT databases, respectively.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
