Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers

Runyuan Cai; Yu Lin; Yiming Wang; Chunlin Fu; Xiaodong Zeng

arXiv:2601.10770·cs.SD·January 19, 2026

Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers

Runyuan Cai, Yu Lin, Yiming Wang, Chunlin Fu, Xiaodong Zeng

PDF

Open Access 1 Models

TL;DR

This paper introduces GPA, a unified autoregressive model that integrates speech recognition, synthesis, and conversion tasks within a single framework, enhancing efficiency and cross-task generalization.

Contribution

The paper presents GPA, a novel unified audio foundation model that supports multiple speech tasks with a single architecture and shared discrete token space, enabling flexible and efficient multi-task performance.

Findings

01

Achieves competitive performance across speech tasks

02

Supports scalable inference with high throughput

03

Operates efficiently on resource-constrained devices

Abstract

Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization. In this paper, we present General-Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly perform TTS, ASR, and VC without architectural modifications. This unified design combines a fully autoregressive formulation over discrete speech tokens, joint multi-task training across speech domains, and a scalable inference pipeline that achieves high concurrency and throughput. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
AutoArk-AI/GPA
model· 26 dl· ♡ 11
26 dl♡ 11

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research