Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects
Kalvin Chang, Yiwen Shao, Jiahong Li, Dong Yu

TL;DR
This paper develops a speech encoder that aligns Chinese dialects semantically with Mandarin using only ASR data, enabling speech-to-speech retrieval and advancing dialect speech technologies.
Contribution
It introduces a novel speech encoder trained with ASR data for cross-dialect semantic alignment and provides a new Chinese dialect benchmark for speech-to-speech retrieval.
Findings
Achieved state-of-the-art ASR performance on Chinese dialects
Demonstrated effective speech-to-speech retrieval across dialects
Provided a new benchmark for Chinese dialect speech evaluation
Abstract
Despite having hundreds of millions of speakers, Chinese dialects lag behind Mandarin in speech and language technologies. Most varieties are primarily spoken, making dialect-to-Mandarin speech-LLMs (large language models) more practical than dialect LLMs. Building dialect-to-Mandarin speech-LLMs requires speech representations with cross-dialect semantic alignment between Chinese dialects and Mandarin. In this paper, we achieve such a cross-dialect semantic alignment by training a speech encoder with ASR (automatic speech recognition)-only data, as demonstrated by speech-to-speech retrieval on a new benchmark of spoken Chinese varieties that we contribute. Our speech encoder further demonstrates state-of-the-art ASR performance on Chinese dialects. Together, our Chinese dialect benchmark, semantically aligned speech representations, and speech-to-speech retrieval evaluation lay the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Authorship Attribution and Profiling · Natural Language Processing Techniques
