Scalable Frameworks for Real-World Audio-Visual Speech Recognition

Sungnyun Kim

arXiv:2512.14083·eess.AS·December 17, 2025

Scalable Frameworks for Real-World Audio-Visual Speech Recognition

Sungnyun Kim

PDF

Open Access

TL;DR

This paper proposes a hierarchical, scalable framework for robust audio-visual speech recognition in real-world environments, addressing challenges at the representation, architecture, and system levels to improve performance amidst noise and interference.

Contribution

It introduces a comprehensive, multi-level approach to enhance AVSR systems' robustness and scalability for real-world deployment, integrating novel methods at each system level.

Findings

01

Unified audio-visual features improve noise robustness

02

Adaptive model capacity enhances scalability

03

Integration with foundation models boosts accuracy

Abstract

The practical deployment of Audio-Visual Speech Recognition (AVSR) systems is fundamentally challenged by significant performance degradation in real-world environments, characterized by unpredictable acoustic noise and visual interference. This dissertation posits that a systematic, hierarchical approach is essential to overcome these challenges, achieving the robust scalability at the representation, architecture, and system levels. At the representation level, we investigate methods for building a unified model that learns audio-visual features inherently robust to diverse real-world corruptions, thereby enabling generalization to new environments without specialized modules. To address architectural scalability, we explore how to efficiently expand model capacity while ensuring the adaptive and reliable use of multimodal inputs, developing a framework that intelligently allocates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis