TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs

Jing Peng; Chenghao Wang; Yi Yang; Lirong Qian; Junjie Li; Yu Xi; Shuai Wang; Kai Yu

arXiv:2604.08384·eess.AS·April 10, 2026

TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs

Jing Peng, Chenghao Wang, Yi Yang, Lirong Qian, Junjie Li, Yu Xi, Shuai Wang, Kai Yu

PDF

TL;DR

TASU2 introduces a controllable CTC simulation framework that enhances low-resource speech model adaptation by enabling precise supervision difficulty control without TTS, improving recognition accuracy.

Contribution

It presents TASU2, a novel method for simulating CTC posteriors with adjustable WER, facilitating better curriculum design and improved adaptation performance.

Findings

01

TASU2 outperforms TASU in various adaptation settings.

02

It surpasses text-only fine-tuning and TTS-based augmentation baselines.

03

It reduces source-domain performance degradation.

Abstract

Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcripts, but they provide limited control over uncertainty and error rate, making curriculum design largely heuristic. We propose \textbf{TASU2}, a controllable CTC simulation framework that simulates CTC posterior distributions under a specified WER range, producing text-derived supervision that better matches the acoustic decoding interface. This enables principled post-training curricula that smoothly vary supervision difficulty without TTS. Across multiple source-to-target adaptation settings, TASU2 improves in-domain and out-of-domain recognition over TASU, and consistently outperforms strong baselines including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.