Residual Tokens Enhance Masked Autoencoders for Speech Modeling

Samir Sadok; St\'ephane Lathuili\`ere; Xavier Alameda-Pineda

arXiv:2601.19399·cs.SD·January 28, 2026

Residual Tokens Enhance Masked Autoencoders for Speech Modeling

Samir Sadok, St\'ephane Lathuili\`ere, Xavier Alameda-Pineda

PDF

Open Access

TL;DR

RT-MAE introduces residual trainable tokens to masked autoencoders, capturing nuanced speech features beyond explicit attributes, leading to improved reconstruction, expressivity, and effective speech enhancement.

Contribution

The paper proposes RT-MAE, a novel framework that combines supervised attributes with unsupervised residual tokens for richer speech modeling.

Findings

01

Improved speech reconstruction quality.

02

Enhanced expressivity and naturalness.

03

Effective noise removal in speech enhancement.

Abstract

Recent speech modeling relies on explicit attributes such as pitch, content, and speaker identity, but these alone cannot capture the full richness of natural speech. We introduce RT-MAE, a novel masked autoencoder framework that augments the supervised attributes-based modeling with unsupervised residual trainable tokens, designed to encode the information not explained by explicit labeled factors (e.g., timbre variations, noise, emotion etc). Experiments show that RT-MAE improves reconstruction quality, preserving content and speaker similarity while enhancing expressivity. We further demonstrate its applicability to speech enhancement, removing noise at inference while maintaining controllability and naturalness.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis