Ti-Audio: The First Multi-Dialectal End-to-End Speech LLM for Tibetan

Jialing Wang; Yue Zhao; Yuhao Zhang; Jing Yu; Shaosai Li; Zhanchen Dai; Benyou Wang; Haizhou Li

arXiv:2604.11110·cs.SD·April 29, 2026

Ti-Audio: The First Multi-Dialectal End-to-End Speech LLM for Tibetan

Jialing Wang, Yue Zhao, Yuhao Zhang, Jing Yu, Shaosai Li, Zhanchen Dai, Benyou Wang, Haizhou Li

PDF

TL;DR

Ti-Audio is a pioneering multi-dialectal end-to-end Speech-LLM for Tibetan, leveraging cross-dialectal cooperation and novel alignment techniques to excel in low-resource, dialect-diverse environments.

Contribution

The paper introduces Ti-Audio, the first multi-dialectal Tibetan Speech-LLM, with a Dynamic Q-Former Adapter and a mutual assistance strategy for low-resource dialectal speech tasks.

Findings

01

Achieves state-of-the-art results on Tibetan speech benchmarks.

02

Effectively utilizes dialectal mutual assistance to improve performance.

03

Validates cross-dialectal cooperation as a scalable approach.

Abstract

Recent advances in Speech Large Language Models (Speech-LLMs) have made significant progress, greatly enhancing multimodal interaction capabilities.However, their application in low-resource and dialect-diverse environments still faces challenges. The severe scarcity of Tibetan data, coupled with the phonetic differences among its major dialects (\"U-Tsang, Amdo, and Kham), is a prime example of this challenge. This paper proposes Ti-Audio, the first multi-dialectal end-to-end Speech-LLM for Tibetan. To efficiently align speech and text, we introduce a Dynamic Q-Former Adapter that extracts essential acoustic features from variable-length speech, ensuring stable cross-modal alignment even with limited data. At the data level, we leverage mutual assistance among related dialects to alleviate data scarcity and employ a temperature-based sampling strategy to maximize this synergy.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.