Tailored Design of Audio-Visual Speech Recognition Models using   Branchformers

David Gimeno-G\'omez; Carlos-D. Mart\'inez-Hinarejos

arXiv:2407.06606·cs.CV·May 7, 2025

Tailored Design of Audio-Visual Speech Recognition Models using Branchformers

David Gimeno-G\'omez, Carlos-D. Mart\'inez-Hinarejos

PDF

Open Access 1 Repo

TL;DR

This paper introduces a flexible, parameter-efficient audio-visual speech recognition framework using Branchformer architectures, achieving state-of-the-art results with reduced complexity on English and Spanish benchmarks.

Contribution

It is the first to apply encoder architectures like Branchformer to design tailored, interpretable AVSR systems that outperform existing models in accuracy and efficiency.

Findings

01

Achieved approximately 2.5% WER on English AVSR benchmark.

02

Surpassed existing approaches for Spanish AVSR, establishing a new benchmark.

03

Reduced model complexity while maintaining state-of-the-art recognition rates.

Abstract

Recent advances in Audio-Visual Speech Recognition (AVSR) have led to unprecedented achievements in the field, improving the robustness of this type of system in adverse, noisy environments. In most cases, this task has been addressed through the design of models composed of two independent encoders, each dedicated to a specific modality. However, while recent works have explored unified audio-visual encoders, determining the optimal cross-modal architecture remains an ongoing challenge. Furthermore, such approaches often rely on models comprising vast amounts of parameters and high computational cost training processes. In this paper, we aim to bridge this research gap by introducing a novel audio-visual framework. Our proposed method constitutes, to the best of our knowledge, the first attempt to harness the flexibility and interpretability offered by encoder architectures, such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

david-gimeno/tailored-avsr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing