Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training
Jianfeng He, Julian Salazar, Kaisheng Yao, Haoqi Li, Jinglun Cai

TL;DR
This paper introduces a novel cross-modal selective self-training method for zero-shot end-to-end spoken language understanding, effectively addressing domain mismatch, imbalance, and noise issues to improve performance without speech-semantics pairs.
Contribution
The paper proposes CMSST, a new approach that uses clustering and a selection network to enhance zero-shot E2E SLU, along with new benchmarks for matched and mismatched speech scenarios.
Findings
CMSST outperforms previous methods in zero-shot SLU tasks.
Significant reduction in training data and time needed.
Effective handling of domain mismatch and noise issues.
Abstract
End-to-end (E2E) spoken language understanding (SLU) is constrained by the cost of collecting speech-semantics pairs, especially when label domains change. Hence, we explore \textit{zero-shot} E2E SLU, which learns E2E SLU without speech-semantics pairs, instead using only speech-text and text-semantics pairs. Previous work achieved zero-shot by pseudolabeling all speech-text transcripts with a natural language understanding (NLU) model learned on text-semantics corpora. However, this method requires the domains of speech-text and text-semantics to match, which often mismatch due to separate collections. Furthermore, using the entire collected speech-text corpus from any domains leads to \textit{imbalance} and \textit{noise} issues. To address these, we propose \textit{cross-modal selective self-training} (CMSST). CMSST tackles imbalance by clustering in a joint space of the three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
