Audio-Linguistic Embeddings for Spoken Sentences

Albert Haque; Michelle Guo; Prateek Verma; Li Fei-Fei

arXiv:1902.07817·cs.SD·February 22, 2019·5 cites

Audio-Linguistic Embeddings for Spoken Sentences

Albert Haque, Michelle Guo, Prateek Verma, Li Fei-Fei

PDF

Open Access 1 Repo

TL;DR

This paper introduces spoken sentence embeddings that integrate acoustic and linguistic information, improving performance on speech recognition and emotion detection by modeling long-term dependencies at the sentence level.

Contribution

It presents a novel multitask learning approach to generate audio-linguistic embeddings that outperform existing phoneme and word-level methods.

Findings

01

Outperforms phoneme and word-level baselines in speech and emotion recognition

02

Embeddings better model high-level acoustic concepts while preserving linguistic content

03

Demonstrates the effectiveness of multi-modal sentence embeddings for spoken language understanding

Abstract

We propose spoken sentence embeddings which capture both acoustic and linguistic content. While existing works operate at the character, phoneme, or word level, our method learns long-term dependencies by modeling speech at the sentence level. Formulated as an audio-linguistic multitask learning problem, our encoder-decoder model simultaneously reconstructs acoustic and natural language features from audio. Our results show that spoken sentence embeddings outperform phoneme and word-level baselines on speech recognition and emotion recognition tasks. Ablation studies show that our embeddings can better model high-level acoustic concepts while retaining linguistic content. Overall, our work illustrates the viability of generic, multi-modal sentence embeddings for spoken language understanding.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

awesomericky/Speech2Pickup
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling