Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning   of CLIP and Fastspeech2

Chun Xu; En-Wei Sun

arXiv:2407.14212·cs.SD·July 22, 2024

Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2

Chun Xu, En-Wei Sun

PDF

Open Access

TL;DR

This paper presents a Chinese Braille-to-speech framework using joint fine-tuning of CLIP and Fastspeech2, improving speech synthesis quality and efficiency for visually impaired users with limited data.

Contribution

It introduces a novel Chinese Braille-to-speech model with joint fine-tuning of CLIP and Fastspeech2, addressing data limitations and language applicability issues.

Findings

01

Improved BLEU4, FAD, WER metrics on multiple datasets.

02

High-quality speech synthesis with limited data.

03

Effective joint training strategy validated.

Abstract

An increasing number of Chinese people are troubled by different degrees of visual impairment, which has made the modal conversion between a single image or video frame in the visual field and the audio expressing the same information a research hotspot. Deep learning technologies such as OCR+Vocoder and Im2Wav enable English audio synthesis or image-to-sound matching in a self-supervised manner. However, the audio data used for training is limited and English is not universal for visually impaired people with different educational levels. Therefore, for the sake of solving the problems of data volume and language applicability to improve the reading efficiency of visually impaired people, a set of image-to-speech framework CLIP-KNN-Fastspeech2 based on the Chinese context was constructed. The framework integrates multiple basic models and adopts the strategy of independent pre-training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Tactile and Sensory Interactions

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training