An Attribute Interpolation Method in Speech Synthesis by Model Merging
Masato Murata, Koichi Miyazaki, Tomoki Koriyama

TL;DR
This paper introduces a simple and effective attribute interpolation method in speech synthesis by merging trained models, enabling smooth control over speaker and emotion attributes without additional training.
Contribution
The paper proposes a novel model merging technique for attribute interpolation in speech synthesis that does not require specialized modules or retraining.
Findings
Achieved smooth attribute interpolation in speaker generation.
Successfully controlled emotion intensity through model merging.
Maintained linguistic content during attribute interpolation.
Abstract
With the development of speech synthesis, recent research has focused on challenging tasks, such as speaker generation and emotion intensity control. Attribute interpolation is a common approach to these tasks. However, most previous methods for attribute interpolation require specific modules or training methods. We propose an attribute interpolation method in speech synthesis by model merging. Model merging is a method that creates new parameters by only averaging the parameters of base models. The merged model can generate an output with an intermediate feature of the base models. This method is easily applicable without specific modules or training methods, as it uses only existing trained base models. We merged two text-to-speech models to achieve attribute interpolation and evaluated its performance on speaker generation and emotion intensity control tasks. As a result, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
