Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal   Selective Self-Training

Jianfeng He; Julian Salazar; Kaisheng Yao; Haoqi Li; Jinglun Cai

arXiv:2305.12793·eess.AS·February 6, 2024·2 cites

Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

Jianfeng He, Julian Salazar, Kaisheng Yao, Haoqi Li, Jinglun Cai

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel cross-modal selective self-training method for zero-shot end-to-end spoken language understanding, effectively addressing domain mismatch, imbalance, and noise issues to improve performance without speech-semantics pairs.

Contribution

The paper proposes CMSST, a new approach that uses clustering and a selection network to enhance zero-shot E2E SLU, along with new benchmarks for matched and mismatched speech scenarios.

Findings

01

CMSST outperforms previous methods in zero-shot SLU tasks.

02

Significant reduction in training data and time needed.

03

Effective handling of domain mismatch and noise issues.

Abstract

End-to-end (E2E) spoken language understanding (SLU) is constrained by the cost of collecting speech-semantics pairs, especially when label domains change. Hence, we explore \textit{zero-shot} E2E SLU, which learns E2E SLU without speech-semantics pairs, instead using only speech-text and text-semantics pairs. Previous work achieved zero-shot by pseudolabeling all speech-text transcripts with a natural language understanding (NLU) model learned on text-semantics corpora. However, this method requires the domains of speech-text and text-semantics to match, which often mismatch due to separate collections. Furthermore, using the entire collected speech-text corpus from any domains leads to \textit{imbalance} and \textit{noise} issues. To address these, we propose \textit{cross-modal selective self-training} (CMSST). CMSST tackles imbalance by clustering in a joint space of the three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amazon-science/zero-shot-e2e-slu
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling