Residual Tokens Enhance Masked Autoencoders for Speech Modeling
Samir Sadok, St\'ephane Lathuili\`ere, Xavier Alameda-Pineda

TL;DR
RT-MAE introduces residual trainable tokens to masked autoencoders, capturing nuanced speech features beyond explicit attributes, leading to improved reconstruction, expressivity, and effective speech enhancement.
Contribution
The paper proposes RT-MAE, a novel framework that combines supervised attributes with unsupervised residual tokens for richer speech modeling.
Findings
Improved speech reconstruction quality.
Enhanced expressivity and naturalness.
Effective noise removal in speech enhancement.
Abstract
Recent speech modeling relies on explicit attributes such as pitch, content, and speaker identity, but these alone cannot capture the full richness of natural speech. We introduce RT-MAE, a novel masked autoencoder framework that augments the supervised attributes-based modeling with unsupervised residual trainable tokens, designed to encode the information not explained by explicit labeled factors (e.g., timbre variations, noise, emotion etc). Experiments show that RT-MAE improves reconstruction quality, preserving content and speaker similarity while enhancing expressivity. We further demonstrate its applicability to speech enhancement, removing noise at inference while maintaining controllability and naturalness.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis
