Tailored Design of Audio-Visual Speech Recognition Models using Branchformers
David Gimeno-G\'omez, Carlos-D. Mart\'inez-Hinarejos

TL;DR
This paper introduces a flexible, parameter-efficient audio-visual speech recognition framework using Branchformer architectures, achieving state-of-the-art results with reduced complexity on English and Spanish benchmarks.
Contribution
It is the first to apply encoder architectures like Branchformer to design tailored, interpretable AVSR systems that outperform existing models in accuracy and efficiency.
Findings
Achieved approximately 2.5% WER on English AVSR benchmark.
Surpassed existing approaches for Spanish AVSR, establishing a new benchmark.
Reduced model complexity while maintaining state-of-the-art recognition rates.
Abstract
Recent advances in Audio-Visual Speech Recognition (AVSR) have led to unprecedented achievements in the field, improving the robustness of this type of system in adverse, noisy environments. In most cases, this task has been addressed through the design of models composed of two independent encoders, each dedicated to a specific modality. However, while recent works have explored unified audio-visual encoders, determining the optimal cross-modal architecture remains an ongoing challenge. Furthermore, such approaches often rely on models comprising vast amounts of parameters and high computational cost training processes. In this paper, we aim to bridge this research gap by introducing a novel audio-visual framework. Our proposed method constitutes, to the best of our knowledge, the first attempt to harness the flexibility and interpretability offered by encoder architectures, such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
