End-to-end speech-to-dialog-act recognition

Viet-Trung Dang; Tianyu Zhao; Sei Ueno; Hirofumi Inaguma; Tatsuya; Kawahara

arXiv:2004.11419·cs.SD·July 30, 2020·1 cites

End-to-end speech-to-dialog-act recognition

Viet-Trung Dang, Tianyu Zhao, Sei Ueno, Hirofumi Inaguma, Tatsuya, Kawahara

PDF

Open Access

TL;DR

This paper introduces an end-to-end speech-to-dialog-act recognition model that integrates acoustic features and dialog act detection, improving accuracy and robustness over traditional pipeline methods.

Contribution

The paper presents a novel end-to-end model combining acoustic-to-word ASR with dialog act recognition, enabling joint training and improved performance.

Findings

01

Significant accuracy improvement over traditional methods

02

Robustness against ASR errors demonstrated

03

Joint DA segmentation further enhances results

Abstract

Spoken language understanding, which extracts intents and/or semantic concepts in utterances, is conventionally formulated as a post-processing of automatic speech recognition. It is usually trained with oracle transcripts, but needs to deal with errors by ASR. Moreover, there are acoustic features which are related with intents but not represented with the transcripts. In this paper, we present an end-to-end model which directly converts speech into dialog acts without the deterministic transcription process. In the proposed model, the dialog act recognition network is conjunct with an acoustic-to-word ASR model at its latent layer before the softmax layer, which provides a distributed representation of word-level ASR decoding information. Then, the entire network is fine-tuned in an end-to-end manner. This allows for stable training as well as robustness against ASR errors. The model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Topic Modeling

MethodsSoftmax