Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR

Jaeyoung Lee; Masato Mimura

arXiv:2602.12546·eess.AS·February 16, 2026

Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR

Jaeyoung Lee, Masato Mimura

PDF

Open Access

TL;DR

This paper introduces a decoder-only Conformer model for ASR that integrates speech and text processing using modality-aware sparse MoE, achieving superior accuracy over traditional models without external encoders or large language models.

Contribution

It presents the first randomly initialized decoder-only ASR model with modality-aware routing and sparse MoE, surpassing strong AED baselines in accuracy.

Findings

01

Improves WER on Librispeech over AED baseline.

02

Reduces average WER on multilingual Common Voice dataset.

03

Achieves better accuracy with fewer active parameters.

Abstract

We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert pools for speech and text with hard routing and top-1 selection, embedded in hybrid-causality Conformer blocks (bidirectional for speech, causal for text). Training combines CTC on speech positions with label-smoothed cross-entropy for text generation. Our 113M-parameter model consistently improves WER over a 139M AED baseline on Librispeech (2.8% vs. 3.2% test-clean; 5.6% vs. 6.0% test-other). On Common Voice 16.1 with a single multilingual model across five languages, our approach reduces average WER from 12.2% to 10.6%. To our knowledge, this is the first randomly initialized decoder-only ASR that surpasses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques