Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects

Kalvin Chang; Yiwen Shao; Jiahong Li; Dong Yu

arXiv:2601.07274·cs.CL·January 13, 2026

Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects

Kalvin Chang, Yiwen Shao, Jiahong Li, Dong Yu

PDF

Open Access 2 Datasets

TL;DR

This paper develops a speech encoder that aligns Chinese dialects semantically with Mandarin using only ASR data, enabling speech-to-speech retrieval and advancing dialect speech technologies.

Contribution

It introduces a novel speech encoder trained with ASR data for cross-dialect semantic alignment and provides a new Chinese dialect benchmark for speech-to-speech retrieval.

Findings

01

Achieved state-of-the-art ASR performance on Chinese dialects

02

Demonstrated effective speech-to-speech retrieval across dialects

03

Provided a new benchmark for Chinese dialect speech evaluation

Abstract

Despite having hundreds of millions of speakers, Chinese dialects lag behind Mandarin in speech and language technologies. Most varieties are primarily spoken, making dialect-to-Mandarin speech-LLMs (large language models) more practical than dialect LLMs. Building dialect-to-Mandarin speech-LLMs requires speech representations with cross-dialect semantic alignment between Chinese dialects and Mandarin. In this paper, we achieve such a cross-dialect semantic alignment by training a speech encoder with ASR (automatic speech recognition)-only data, as demonstrated by speech-to-speech retrieval on a new benchmark of spoken Chinese varieties that we contribute. Our speech encoder further demonstrates state-of-the-art ASR performance on Chinese dialects. Together, our Chinese dialect benchmark, semantically aligned speech representations, and speech-to-speech retrieval evaluation lay the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Authorship Attribution and Profiling · Natural Language Processing Techniques