Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization
Yihan Wu, Yichen Lu, Yifan Peng, Xihua Wang, Ruihua Song, Shinji, Watanabe

TL;DR
This paper introduces BPO-AVASR, a novel optimization method that leverages input and output preferences to enhance audiovisual speech recognition accuracy in real-world scenarios, outperforming existing models.
Contribution
The paper proposes a bifocal preference optimization strategy that utilizes simulated error preferences on both input and output to improve AV-ASR models in unconstrained environments.
Findings
Significant accuracy improvements over state-of-the-art models
Effective handling of noisy and spontaneous speech scenarios
Robust performance across diverse real-world video datasets
Abstract
Audiovisual Automatic Speech Recognition (AV-ASR) aims to improve speech recognition accuracy by leveraging visual signals. It is particularly challenging in unconstrained real-world scenarios across various domains due to noisy acoustic environments, spontaneous speech, and the uncertain use of visual information. Most previous works fine-tune audio-only ASR models on audiovisual datasets, optimizing them for conventional ASR objectives. However, they often neglect visual features and common errors in unconstrained video scenarios. In this paper, we propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos. First, we create preference data via simulating common errors that occurred in AV-ASR from two focals: manipulating the audio or vision input and rewriting the output transcript. Second, we propose BPO-AVASR, a Bifocal Preference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and Audio Processing
