DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations
Chao-Hong Tan, Qian Chen, Wen Wang, Chong Deng, Qinglin Zhang, Luyao Cheng, Hai Yu, Xin Zhang, Xiang Lv, Tianyu Zhao, Chong Zhang, Yukun Ma, Yafeng Chen, Hui Wang, Jiaqing Liu, Xiangang Li, Jieping Ye

TL;DR
DrVoice introduces a joint autoregressive model with dual-resolution speech representations for parallel speech-text generation, reducing computational costs and achieving state-of-the-art results on multiple benchmarks.
Contribution
The paper proposes a novel dual-resolution speech representation mechanism within a joint autoregressive framework, improving efficiency and performance in speech-text voice conversation models.
Findings
Achieves new SOTA on OpenAudioBench, VoiceBench, UltraEval-Audio, Big Bench Audio
Reduces computational cost by lowering input frequency to 5Hz
Demonstrates effective exploitation of LLM capabilities in speech generation
Abstract
Recent studies on end-to-end (E2E) speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing E2E approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM's autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Notably, while current methods utilize mainly 12.5Hz input audio representation, our…
Peer Reviews
Decision·ICLR 2026 Poster
1. The central idea of using dual-resolution speech representations is well-motivated and addresses a key, practical problem in joint speech-text modeling: the significant discrepancy in token rates between speech and text. The proposed grouping/ungrouping mechanism is an elegant solution that directly tackles this issue, leading to significant computational efficiency gains (processing at 5Hz within the LLM) without sacrificing output quality, thanks to the Speech Refined Head. 2. The paper pr
1. The paper notes that DrVoice's ASR-WER (11.2) is higher than that of Qwen2.5-Omni (3.48), suggesting weaker text-speech alignment. The authors hypothesize this is because Qwen2.5-Omni feeds text directly into its "Talker" module, while DrVoice only uses hidden states. This is a crucial architectural trade-off. While the proposed solution (adding text as input to SRH) is mentioned as future work, the current limitation is significant. High ASR-WER can indicate issues with intelligibility or wo
The model extended existing parallel joint speech-text model by the new methods below: 1. The author contributed an idea of reducing the temporal resolution of extracted audio representations from 25Hz to 5Hz to match the 3Hz text representation using speech token grouping. This benefits to reduce the computational costs and alleviate the frequency discrepancy between different representations. Leveraging the fine-grained acoustic information during generative scenarios is intuitive. The evaluat
1. Although the grouping mechanism helps with reducing the temporal resolution and alleviate two representations, the ungrouping and the Speech Refined Head is making the model architecture more complicated and may worsen the model's speed due to its auto-regressive nature. 2. In the benchmark results, the model doesn't beat all the baselines in OpenAudioBench. The author claimed that the average score is the highest, but it is also true that it is not showing comparable perforamance on some tas
originality * Use of both continuous and discrete representations of audio simultaneously is relatively less investigated in multimodal LLMs. Chain of modality and model weight mixing are presented as novelties which are somewhat novel. Especially, model averaging by linear combination is not particularly new but its application to LLMs during training might be an experimental design novelty. quality * The experimental results show competitive performance and the paper claims the SOTA results o
* The novelty is limited but sufficient (see the comments in the strengths section) Most of the individual components are well known (audio encoders, audio decoders, the LLM backbone, etc.) * The main text skipped many details and provided those details in the Appendices. Which makes the reviewer questioning whether the paper needed more number of pages and a conference setting is not suitable for this paper. * Even though one of the major claims of the paper is the computational cost reducti
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
