F5-TTS-RO: Extending F5-TTS to Romanian TTS via Lightweight Input Adaptation
Radu-Gabriel Chivereanu, Tiberiu Boros

TL;DR
This paper presents a lightweight adaptation method for extending the F5-TTS text-to-speech model to support Romanian, preserving original capabilities while enabling natural Romanian speech synthesis with minimal retraining.
Contribution
Introduces a novel input-level adapter for F5-TTS that supports Romanian by adding a sub-network trained on Romanian text, keeping original weights frozen.
Findings
Maintains voice cloning capabilities in Romanian.
Enables code-switching between Romanian and English.
Achieves natural-sounding Romanian speech with residual English accent.
Abstract
This work introduces a lightweight input-level adapter for the F5-TTS model that enables Romanian Language support. To preserve the existing capabilities of the model (voice cloning, English and Chinese support), we keep the original weights frozen, append a sub-network to the model and train it as an extension for the textual embedding matrix of the text encoder. For simplicity, we rely on ConvNeXt module implemented in F5-TTS to also model the co-dependencies between the new character-level embeddings. The module serves as a ``soft`` letter-to-sound layer, converting Romanian text into a continuous representation that the F5-TTS model uses to produce naturally sounding Romanian utterances. We evaluate the model with a pool of 20 human listeners across three tasks: (a) audio similarity between reference and generated speech, (b) pronunciation and naturalness and (c) Romanian-English…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Phonetics and Phonology Research
