Enhancing Audiovisual Speech Recognition through Bifocal Preference   Optimization

Yihan Wu; Yichen Lu; Yifan Peng; Xihua Wang; Ruihua Song; Shinji; Watanabe

arXiv:2412.19005·eess.AS·December 30, 2024

Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization

Yihan Wu, Yichen Lu, Yifan Peng, Xihua Wang, Ruihua Song, Shinji, Watanabe

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces BPO-AVASR, a novel optimization method that leverages input and output preferences to enhance audiovisual speech recognition accuracy in real-world scenarios, outperforming existing models.

Contribution

The paper proposes a bifocal preference optimization strategy that utilizes simulated error preferences on both input and output to improve AV-ASR models in unconstrained environments.

Findings

01

Significant accuracy improvements over state-of-the-art models

02

Effective handling of noisy and spontaneous speech scenarios

03

Robust performance across diverse real-world video datasets

Abstract

Audiovisual Automatic Speech Recognition (AV-ASR) aims to improve speech recognition accuracy by leveraging visual signals. It is particularly challenging in unconstrained real-world scenarios across various domains due to noisy acoustic environments, spontaneous speech, and the uncertain use of visual information. Most previous works fine-tune audio-only ASR models on audiovisual datasets, optimizing them for conventional ASR objectives. However, they often neglect visual features and common errors in unconstrained video scenarios. In this paper, we propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos. First, we create preference data via simulating common errors that occurred in AV-ASR from two focals: manipulating the audio or vision input and rewriting the output transcript. Second, we propose BPO-AVASR, a Bifocal Preference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

espnet/espnet
pytorchOfficial

Videos

Enhancing Audiovisual Speech Recognition Through Bifocal Preference Optimization· underline

Taxonomy

TopicsSpeech and Audio Processing