Reliable Visualization for Deep Speaker Recognition

Pengqi Li; Lantian Li; Askar Hamdulla; Dong Wang

arXiv:2204.03852·cs.SD·April 13, 2022

Reliable Visualization for Deep Speaker Recognition

Pengqi Li, Lantian Li, Askar Hamdulla, Dong Wang

PDF

Open Access

TL;DR

This paper evaluates visualization methods for CNN-based speaker recognition, finding Layer-CAM to be the most reliable tool for interpreting model decisions in this domain.

Contribution

It provides an extensive analysis of CAM-based visualization methods, identifying Layer-CAM as a reliable tool for explaining CNNs in speaker recognition.

Findings

01

Layer-CAM produces reliable visualizations for speaker models.

02

Grad-CAM and Score-CAM are less reliable in this context.

03

The study enhances interpretability of CNNs in speaker recognition.

Abstract

In spite of the impressive success of convolutional neural networks (CNNs) in speaker recognition, our understanding to CNNs' internal functions is still limited. A major obstacle is that some popular visualization tools are difficult to apply, for example those producing saliency maps. The reason is that speaker information does not show clear spatial patterns in the temporal-frequency space, which makes it hard to interpret the visualization results, and hence hard to confirm the reliability of a visualization tool. In this paper, we conduct an extensive analysis on three popular visualization methods based on CAM: Grad-CAM, Score-CAM and Layer-CAM, to investigate their reliability for speaker recognition tasks. Experiments conducted on a state-of-the-art ResNet34SE model show that the Layer-CAM algorithm can produce reliable visualization, and thus can be used as a promising tool to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques