Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers
Runyuan Cai, Yu Lin, Yiming Wang, Chunlin Fu, Xiaodong Zeng

TL;DR
This paper introduces GPA, a unified autoregressive model that integrates speech recognition, synthesis, and conversion tasks within a single framework, enhancing efficiency and cross-task generalization.
Contribution
The paper presents GPA, a novel unified audio foundation model that supports multiple speech tasks with a single architecture and shared discrete token space, enabling flexible and efficient multi-task performance.
Findings
Achieves competitive performance across speech tasks
Supports scalable inference with high throughput
Operates efficiently on resource-constrained devices
Abstract
Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization. In this paper, we present General-Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly perform TTS, ASR, and VC without architectural modifications. This unified design combines a fully autoregressive formulation over discrete speech tokens, joint multi-task training across speech domains, and a scalable inference pipeline that achieves high concurrency and throughput. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
