A Multi-Resolution Front-End for End-to-End Speech Anti-Spoofing

Wei Liu; Meng Sun; Xiongwei Zhang; Hugo Van hamme; Thomas Fang Zheng

arXiv:2110.05087·cs.SD·October 12, 2021

A Multi-Resolution Front-End for End-to-End Speech Anti-Spoofing

Wei Liu, Meng Sun, Xiongwei Zhang, Hugo Van hamme, Thomas Fang Zheng

PDF

Open Access

TL;DR

This paper introduces a multi-resolution front-end that automatically learns optimal combinations of time-frequency resolutions for speech anti-spoofing, improving classification performance while reducing model complexity.

Contribution

It proposes a learnable multi-resolution feature extraction method with automatic weighting and pruning, enhancing end-to-end speech anti-spoofing systems.

Findings

01

Outperforms baseline methods on ASVSpoof 2019 dataset

02

Automatically learns optimal time-frequency resolution combinations

03

Reduces model complexity through pruning

Abstract

The choice of an optimal time-frequency resolution is usually a difficult but important step in tasks involving speech signal classification, e.g., speech anti-spoofing. The variations of the performance with different choices of timefrequency resolutions can be as large as those with different model architectures, which makes it difficult to judge what the improvement actually comes from when a new network architecture is invented and introduced as the classifier. In this paper, we propose a multi-resolution front-end for feature extraction in an end-to-end classification framework. Optimal weighted combinations of multiple time-frequency resolutions will be learned automatically given the objective of a classification task. Features extracted with different time-frequency resolutions are weighted and concatenated as inputs to the successive networks, where the weights are predicted by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Infant Health and Development