ViSpeR: Multilingual Audio-Visual Speech Recognition

Sanath Narayan; Yasser Abdelaziz Dahou Djilali; Ankit Singh; Eustache; Le Bihan; Hakim Hacid

arXiv:2406.00038·cs.CL·June 4, 2024

ViSpeR: Multilingual Audio-Visual Speech Recognition

Sanath Narayan, Yasser Abdelaziz Dahou Djilali, Ankit Singh, Eustache, Le Bihan, Hakim Hacid

PDF

Open Access

TL;DR

This paper introduces ViSpeR, a multilingual audio-visual speech recognition model trained on large datasets for five languages, demonstrating competitive performance and providing resources for future research in AVSR.

Contribution

The paper presents a new multilingual AVSR model, ViSpeR, along with large-scale datasets for five languages, and releases code and data to facilitate further research.

Findings

01

ViSpeR achieves competitive benchmarks across five languages.

02

Large-scale multilingual datasets are collected and released.

03

Code and datasets are publicly available for research use.

Abstract

This work presents an extensive and detailed study on Audio-Visual Speech Recognition (AVSR) for five widely spoken languages: Chinese, Spanish, English, Arabic, and French. We have collected large-scale datasets for each language except for English, and have engaged in the training of supervised learning models. Our model, ViSpeR, is trained in a multi-lingual setting, resulting in competitive performance on newly established benchmarks for each language. The datasets and models are released to the community with an aim to serve as a foundation for triggering and feeding further research work and exploration on Audio-Visual Speech Recognition, an increasingly important area of research. Code available at \href{https://github.com/YasserdahouML/visper}{https://github.com/YasserdahouML/visper}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing