Self-supervised restoration of singing voice degraded by pitch shifting using shallow diffusion

Yunyi Liu; Taketo Akama

arXiv:2601.10345·cs.SD·January 16, 2026

Self-supervised restoration of singing voice degraded by pitch shifting using shallow diffusion

Yunyi Liu, Taketo Akama

PDF

Open Access

TL;DR

This paper introduces a self-supervised, shallow diffusion model for high-quality singing voice restoration after pitch shifting, effectively reducing artifacts and preserving natural sound compared to traditional methods.

Contribution

It presents a novel self-supervised learning approach using a lightweight diffusion model in mel space for artifact-resistant pitch shifting in singing voices.

Findings

01

Significantly reduces pitch shift artifacts compared to classical baselines.

02

Uses self-supervised training with pitch shift reversal to simulate realistic artifacts.

03

Achieves natural sound restoration while preserving melody and timing.

Abstract

Pitch shifting has been an essential feature in singing voice production. However, conventional signal processing approaches exhibit well known trade offs such as formant shifts and robotic coloration that becomes more severe at larger transposition jumps. This paper targets high quality pitch shifting for singing by reframing it as a restoration problem: given an audio track that has been pitch shifted (and thus contaminated by artifacts), we recover a natural sounding performance while preserving its melody and timing. Specifically, we use a lightweight, mel space diffusion model driven by frame level acoustic features such as f0, volume, and content features. We construct training pairs in a self supervised manner by applying pitch shifts and reversing them to simulate realistic artifacts while retaining ground truth. On a curated singing set, the proposed approach substantially…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Voice and Speech Disorders · Speech and Audio Processing