Towards End-to-End Integration of Dialog History for Improved Spoken Language Understanding
Vishal Sunder, Samuel Thomas, Hong-Kwang J. Kuo, Jatin Ganhotra, Brian, Kingsbury, Eric Fosler-Lussier

TL;DR
This paper introduces a fully end-to-end spoken language understanding model that directly incorporates dialog history in speech form, improving performance and robustness without relying on cascaded ASR systems.
Contribution
It proposes a hierarchical speech-based conversation model with semantic knowledge distillation and a novel DropFrame technique for efficient training, advancing end-to-end dialog understanding.
Findings
Outperforms history-independent baseline by 7.7% F1 score.
Achieves competitive results with fewer parameters than cascaded models.
Outperforms ASR-dependent baselines by 10% F1 score without gold transcripts.
Abstract
Dialog history plays an important role in spoken language understanding (SLU) performance in a dialog system. For end-to-end (E2E) SLU, previous work has used dialog history in text form, which makes the model dependent on a cascaded automatic speech recognizer (ASR). This rescinds the benefits of an E2E system which is intended to be compact and robust to ASR errors. In this paper, we propose a hierarchical conversation model that is capable of directly using dialog history in speech form, making it fully E2E. We also distill semantic knowledge from the available gold conversation transcripts by jointly training a similar text-based conversation model with an explicit tying of acoustic and semantic embeddings. We also propose a novel technique that we call DropFrame to deal with the long training time incurred by adding dialog history in an E2E manner. On the HarperValleyBank dialog…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Human Pose and Action Recognition
