Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR
Jaeyoung Lee, Masato Mimura

TL;DR
This paper introduces a decoder-only Conformer model for ASR that integrates speech and text processing using modality-aware sparse MoE, achieving superior accuracy over traditional models without external encoders or large language models.
Contribution
It presents the first randomly initialized decoder-only ASR model with modality-aware routing and sparse MoE, surpassing strong AED baselines in accuracy.
Findings
Improves WER on Librispeech over AED baseline.
Reduces average WER on multilingual Common Voice dataset.
Achieves better accuracy with fewer active parameters.
Abstract
We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert pools for speech and text with hard routing and top-1 selection, embedded in hybrid-causality Conformer blocks (bidirectional for speech, causal for text). Training combines CTC on speech positions with label-smoothed cross-entropy for text generation. Our 113M-parameter model consistently improves WER over a 139M AED baseline on Librispeech (2.8% vs. 3.2% test-clean; 5.6% vs. 6.0% test-other). On Common Voice 16.1 with a single multilingual model across five languages, our approach reduces average WER from 12.2% to 10.6%. To our knowledge, this is the first randomly initialized decoder-only ASR that surpasses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques
