Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2
Chun Xu, En-Wei Sun

TL;DR
This paper presents a Chinese Braille-to-speech framework using joint fine-tuning of CLIP and Fastspeech2, improving speech synthesis quality and efficiency for visually impaired users with limited data.
Contribution
It introduces a novel Chinese Braille-to-speech model with joint fine-tuning of CLIP and Fastspeech2, addressing data limitations and language applicability issues.
Findings
Improved BLEU4, FAD, WER metrics on multiple datasets.
High-quality speech synthesis with limited data.
Effective joint training strategy validated.
Abstract
An increasing number of Chinese people are troubled by different degrees of visual impairment, which has made the modal conversion between a single image or video frame in the visual field and the audio expressing the same information a research hotspot. Deep learning technologies such as OCR+Vocoder and Im2Wav enable English audio synthesis or image-to-sound matching in a self-supervised manner. However, the audio data used for training is limited and English is not universal for visually impaired people with different educational levels. Therefore, for the sake of solving the problems of data volume and language applicability to improve the reading efficiency of visually impaired people, a set of image-to-speech framework CLIP-KNN-Fastspeech2 based on the Chinese context was constructed. The framework integrates multiple basic models and adopts the strategy of independent pre-training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Tactile and Sensory Interactions
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
