Time Alignment using Lip Images for Frame-based Electrolaryngeal Voice Conversion
Yi-Syuan Liou, Wen-Chin Huang, Ming-Chi Yen, Shu-Wei Tsai, Yu-Huai, Peng, Tomoki Toda, Yu Tsao, Hsin-Min Wang

TL;DR
This paper introduces a novel lip image-based method for time alignment in electrolaryngeal voice conversion, outperforming traditional audio-only methods by leveraging lip movements to improve speech quality.
Contribution
The study proposes using lip images for time alignment in voice conversion, addressing limitations of audio-only DTW methods for electrolaryngeal speech.
Findings
Lip image-based alignment outperforms audio-only DTW in objective metrics.
Subjective evaluations favor lip-based alignment for naturalness.
Proposed method is robust to EL speech characteristics.
Abstract
Voice conversion (VC) is an effective approach to electrolaryngeal (EL) speech enhancement, a task that aims to improve the quality of the artificial voice from an electrolarynx device. In frame-based VC methods, time alignment needs to be performed prior to model training, and the dynamic time warping (DTW) algorithm is widely adopted to compute the best time alignment between each utterance pair. The validity is based on the assumption that the same phonemes of the speakers have similar features and can be mapped by measuring a pre-defined distance between speech frames of the source and the target. However, the special characteristics of the EL speech can break the assumption, resulting in a sub-optimal DTW alignment. In this work, we propose to use lip images for time alignment, as we assume that the lip movements of laryngectomee remain normal compared to healthy people. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsDynamic Time Warping
