What all do audio transformer models hear? Probing Acoustic   Representations for Language Delivery and its Structure

Jui Shah; Yaman Kumar Singla; Changyou Chen; Rajiv Ratn Shah

arXiv:2101.00387·cs.CL·July 14, 2021·6 cites

What all do audio transformer models hear? Probing Acoustic Representations for Language Delivery and its Structure

Jui Shah, Yaman Kumar Singla, Changyou Chen, Rajiv Ratn Shah

PDF

Open Access

TL;DR

This paper investigates what audio transformer models like Mockingjay and wave2vec2.0 learn about speech, comparing their internal representations to linguistic features and BERT, across diverse speech datasets.

Contribution

It provides a comprehensive analysis of the linguistic and acoustic features captured by audio transformers and evaluates the optimal layers for downstream tasks.

Findings

01

Audio transformers encode various linguistic features.

02

The last layer is not always optimal for downstream tasks.

03

Different models capture different aspects of speech.

Abstract

In recent times, BERT based transformer models have become an inseparable part of the 'tech stack' of text processing models. Similar progress is being observed in the speech domain with a multitude of models observing state-of-the-art results by using audio transformer models to encode speech. This begs the question of what are these audio transformer models learning. Moreover, although the standard methodology is to choose the last layer embedding for any downstream task, but is it the optimal choice? We try to answer these questions for the two recent audio transformer models, Mockingjay and wave2vec2.0. We compare them on a comprehensive set of language delivery and structure features including audio, fluency and pronunciation features. Additionally, we probe the audio models' understanding of textual surface, syntax, and semantic features and compare them to BERT. We do this over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling

MethodsLinear Layer · Weight Decay · Linear Warmup With Linear Decay · Softmax · Dropout · Dense Connections · Multi-Head Attention · Attention Is All You Need · WordPiece · Attention Dropout