SynesLM: A Unified Approach for Audio-visual Speech Recognition and   Translation via Language Model and Synthetic Data

Yichen Lu; Jiaqi Song; Xuankai Chang; Hengwei Bian; Soumi Maiti,; Shinji Watanabe

arXiv:2408.00624·eess.AS·August 2, 2024

SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data

Yichen Lu, Jiaqi Song, Xuankai Chang, Hengwei Bian, Soumi Maiti,, Shinji Watanabe

PDF

Open Access 1 Repo

TL;DR

SynesLM is a unified multimodal model that performs audio-visual speech recognition and translation tasks, leveraging general visual cues and synthetic data to achieve state-of-the-art results across multiple benchmarks.

Contribution

This work introduces SynesLM, a novel unified model that integrates multiple multimodal language tasks using general visual information and synthetic data augmentation.

Findings

01

Achieved state-of-the-art zero-shot AV-ASR performance with lower WER.

02

Outperformed previous models in visual-aided speech translation and translation tasks.

03

Demonstrated competitive performance on the How2 dataset.

Abstract

In this work, we present SynesLM, an unified model which can perform three multimodal language understanding tasks: audio-visual automatic speech recognition(AV-ASR) and visual-aided speech/machine translation(VST/VMT). Unlike previous research that focused on lip motion as visual cues for speech signals, our work explores more general visual information within entire frames, such as objects and actions. Additionally, we use synthetic image data to enhance the correlation between image and speech data. We benchmark SynesLM against the How2 dataset, demonstrating performance on par with state-of-the-art (SOTA) models dedicated to AV-ASR while maintaining our multitasking framework. Remarkably, for zero-shot AV-ASR, SynesLM achieved SOTA performance by lowering the Word Error Rate (WER) from 43.4% to 39.4% on the VisSpeech Dataset. Furthermore, our results in VST and VMT outperform the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

espnet/espnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis