DCIM-AVSR : Efficient Audio-Visual Speech Recognition via Dual Conformer   Interaction Module

Xinyu Wang; Haotian Jiang; Haolin Huang; Yu Fang; Mengjie; Xu; Qian Wang

arXiv:2409.00481·eess.AS·January 9, 2025

DCIM-AVSR : Efficient Audio-Visual Speech Recognition via Dual Conformer Interaction Module

Xinyu Wang, Haotian Jiang, Haolin Huang, Yu Fang, Mengjie, Xu, Qian Wang

PDF

Open Access

TL;DR

This paper introduces an efficient audio-visual speech recognition model that uses a Dual Conformer Interaction Module and a selective pre-training method to reduce parameters and improve performance in noisy environments.

Contribution

The paper presents a novel Dual Conformer Interaction Module and a pre-training strategy that together enhance AVSR efficiency and accuracy while reducing model complexity.

Findings

01

Significant reduction in model parameters without sacrificing accuracy.

02

Improved AVSR performance in noisy conditions.

03

Enhanced efficiency through selective parameter updating.

Abstract

Speech recognition is the technology that enables machines to interpret and process human speech, converting spoken language into text or commands. This technology is essential for applications such as virtual assistants, transcription services, and communication tools. The Audio-Visual Speech Recognition (AVSR) model enhances traditional speech recognition, particularly in noisy environments, by incorporating visual modalities like lip movements and facial expressions. While traditional AVSR models trained on large-scale datasets with numerous parameters can achieve remarkable accuracy, often surpassing human performance, they also come with high training costs and deployment challenges. To address these issues, we introduce an efficient AVSR model that reduces the number of parameters through the integration of a Dual Conformer Interaction Module (DCIM). In addition, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing