Leveraging Self-Supervised Audio-Visual Pretrained Models to Improve Vocoded Speech Intelligibility in Cochlear Implant Simulation

Richard Lee Lai; Jen-Cheng Hou; I-Chun Chern; Kuo-Hsuan Hung; Yi-Ting Chen; Mandar Gogate; Tughrul Arslan; Amir Hussain; and Yu Tsao

arXiv:2307.07748·eess.AS·October 7, 2025·IEEE Trans. Biomed. Eng.

Leveraging Self-Supervised Audio-Visual Pretrained Models to Improve Vocoded Speech Intelligibility in Cochlear Implant Simulation

Richard Lee Lai, Jen-Cheng Hou, I-Chun Chern, Kuo-Hsuan Hung, Yi-Ting Chen, Mandar Gogate, Tughrul Arslan, Amir Hussain, and Yu Tsao

PDF

Open Access

TL;DR

This paper introduces a novel self-supervised audio-visual speech enhancement framework that significantly improves vocoded speech intelligibility for cochlear implant simulations, especially with limited training data.

Contribution

The study proposes SSL-AVSE, a deep neural network combining visual cues and a Transformer-based SSL model to enhance speech intelligibility in cochlear implant simulations with limited data.

Findings

01

SSL-AVSE overcomes limited data issues using AV-HuBERT.

02

Significant PESQ and STOI improvements achieved.

03

Enhanced speech intelligibility in noisy environments for CI users.

Abstract

Individuals with hearing impairments face challenges in their ability to comprehend speech, particularly in noisy environments. The aim of this study is to explore the effectiveness of audio-visual speech enhancement (AVSE) in enhancing the intelligibility of vocoded speech in cochlear implant (CI) simulations. Notably, the study focuses on a challenged scenario where there is limited availability of training data for the AVSE task. To address this problem, we propose a novel deep neural network framework termed Self-Supervised Learning-based AVSE (SSL-AVSE). The proposed SSL-AVSE combines visual cues, such as lip and mouth movements, from the target speakers with corresponding audio signals. The contextually combined audio and visual data are then fed into a Transformer-based SSL AV-HuBERT model to extract features, which are further processed using a BLSTM-based SE model. The results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Subtitles and Audiovisual Media