Vision-Integrated High-Quality Neural Speech Coding

Yao Guo; Yang Ai; Rui-Chen Zheng; Hui-Peng Du; Xiao-Hang Jiang; Zhen-Hua Ling

arXiv:2505.23379·eess.AS·May 30, 2025

Vision-Integrated High-Quality Neural Speech Coding

Yao Guo, Yang Ai, Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Zhen-Hua Ling

PDF

Open Access

TL;DR

This paper introduces a vision-integrated neural speech codec that leverages lip image features to improve speech quality and noise robustness without increasing bitrate.

Contribution

It presents a novel VNSC framework that integrates visual lip features into speech coding, enhancing quality and robustness compared to traditional methods.

Findings

01

Improved speech quality with visual feature integration

02

Enhanced noise robustness in speech decoding

03

No increase in bitrate required for visual integration

Abstract

This paper proposes a novel vision-integrated neural speech codec (VNSC), which aims to enhance speech coding quality by leveraging visual modality information. In VNSC, the image analysis-synthesis module extracts visual features from lip images, while the feature fusion module facilitates interaction between the image analysis-synthesis module and the speech coding module, transmitting visual information to assist the speech coding process. Depending on whether visual information is available during the inference stage, the feature fusion module integrates visual features into the speech coding module using either explicit integration or implicit distillation strategies. Experimental results confirm that integrating visual information effectively improves the quality of the decoded speech and enhances the noise robustness of the neural speech codec, without increasing the bitrate.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Data Compression Techniques · Face recognition and analysis