CartoonSing: Unifying Human and Nonhuman Timbres in Singing Generation
Jionghao Han, Jiatong Shi, Zhuoyan Tao, Yuxun Tang, Yiwen Zhao, Gus Xia, Shinji Watanabe

TL;DR
CartoonSing is a unified machine learning framework that enables the generation of both human and non-human singing voices, addressing data scarcity and timbral gaps to expand creative possibilities in singing synthesis and conversion.
Contribution
It introduces Non-Human Singing Generation as a new task and proposes CartoonSing, a novel two-stage model that bridges human and non-human singing voice synthesis and conversion.
Findings
Successfully generates non-human singing voices
Generalizes to novel non-human timbres
Extends traditional SVS and SVC capabilities
Abstract
Singing voice synthesis (SVS) and singing voice conversion (SVC) have achieved remarkable progress in generating natural-sounding human singing. However, existing systems are restricted to human timbres and have limited ability to synthesize voices outside the human range, which are increasingly demanded in creative applications such as video games, movies, and virtual characters. We introduce Non-Human Singing Generation (NHSG), covering non-human singing voice synthesis (NHSVS) and non-human singing voice conversion (NHSVC), as a novel machine learning task for generating musically coherent singing with non-human timbral characteristics. NHSG is particularly challenging due to the scarcity of non-human singing data, the lack of symbolic alignment, and the wide timbral gap between human and non-human voices. To address these challenges, we propose CartoonSing, a unified framework that…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This paper formally defines NHSVS/NHSVC and unifies SVS/SVC for zero-shot generation beyond the human timbre manifold. 2. Comprehensive training and evaluation. broad datasets (22 singing, 10 non-human), mixed objective/subjective metrics, competitive baselines, detailed ablations on content/timbre representations, and reproducibility commitments with public audio demos.
1. This is a very new task, so baseline comparison is a bit meaningless. The baselines (VISinger 2 and SaMoye-SVC) are extremely weak under the experimental setup of the authors, so I cannot know how "good" the model is. The only way to perceive the model's performance is the demo page. However, even though the SIM metrics of the proposed model has reached the best, the generated voices still sounds very unpleasant. The authors may need to conceive a better metric to measure the quality of the g
1. A clear two-stage formulation that reduces reliance on non-human aligned data by training the score encoder on human singing only, then learning a unified vocoder over human + non-human audio. 2. The paper is generally well-organized and easy to follow.
1. The task defined in this paper is rather narrow and lacks clear real-world applications. The definition of “non-human” is also somewhat vague. Although the authors claim that the proposed task could apply to video games, movies, and virtual characters, the presented demos mainly target instrument-like timbres instead of genuinely non-human vocal styles. 2. The overall demo quality is poor—the generated sentences are short, the audio quality is low, and even the lyrics are often unintelligibl
This paper introduces an interesting new problem: Non-Human Singing Generation. To tackle this, the authors propose a unified framework for Singing Voice Synthesis (SVS) and Singing Voice Conversion (SVC), leveraging self-supervised learning (SSL) features to disentangle timbre, content, and F0, thereby enabling effective timbre transfer.
The disentanglement approach proposed in this paper has already been widely adopted in voice conversion (VC) and text-to-speech (TTS) systems, which weakens the novelty of the work. As a result, the contribution appears to be a relatively trivial extension of these established methods to the domain of non-human singing.
* The idea of extending singing synthesis beyond human vocal timbres is interesting and underexplored. * Combining content tokens, pitch estimation, and timbre embeddings is straightforward and simple approach * Some comparisons between timbre encoders (e.g., RawNet3 vs. CLAP) are included.
* The system is largely a recombination of existing components (ContentVec tokens, RawNet3 embeddings, BigVGAN-v2 vocoder). Claims of unification are weakened by the need for per-domain finetuning. * No confidence intervals or statistical tests are reported. * Baselines (VISinger2, SaMoye) are only trained on human vocals, while CartoonSing is finetuned on non-human data. Improvements in similarity are therefore confounded by data exposure rather than architecture. * RawNet3 is a speaker ver
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Voice and Speech Disorders · Speech Recognition and Synthesis
