TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs
Jing Peng, Chenghao Wang, Yi Yang, Lirong Qian, Junjie Li, Yu Xi, Shuai Wang, Kai Yu

TL;DR
TASU2 introduces a controllable CTC simulation framework that enhances low-resource speech model adaptation by enabling precise supervision difficulty control without TTS, improving recognition accuracy.
Contribution
It presents TASU2, a novel method for simulating CTC posteriors with adjustable WER, facilitating better curriculum design and improved adaptation performance.
Findings
TASU2 outperforms TASU in various adaptation settings.
It surpasses text-only fine-tuning and TTS-based augmentation baselines.
It reduces source-domain performance degradation.
Abstract
Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcripts, but they provide limited control over uncertainty and error rate, making curriculum design largely heuristic. We propose \textbf{TASU2}, a controllable CTC simulation framework that simulates CTC posterior distributions under a specified WER range, producing text-derived supervision that better matches the acoustic decoding interface. This enables principled post-training curricula that smoothly vary supervision difficulty without TTS. Across multiple source-to-target adaptation settings, TASU2 improves in-domain and out-of-domain recognition over TASU, and consistently outperforms strong baselines including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
