Scalable Frameworks for Real-World Audio-Visual Speech Recognition
Sungnyun Kim

TL;DR
This paper proposes a hierarchical, scalable framework for robust audio-visual speech recognition in real-world environments, addressing challenges at the representation, architecture, and system levels to improve performance amidst noise and interference.
Contribution
It introduces a comprehensive, multi-level approach to enhance AVSR systems' robustness and scalability for real-world deployment, integrating novel methods at each system level.
Findings
Unified audio-visual features improve noise robustness
Adaptive model capacity enhances scalability
Integration with foundation models boosts accuracy
Abstract
The practical deployment of Audio-Visual Speech Recognition (AVSR) systems is fundamentally challenged by significant performance degradation in real-world environments, characterized by unpredictable acoustic noise and visual interference. This dissertation posits that a systematic, hierarchical approach is essential to overcome these challenges, achieving the robust scalability at the representation, architecture, and system levels. At the representation level, we investigate methods for building a unified model that learns audio-visual features inherently robust to diverse real-world corruptions, thereby enabling generalization to new environments without specialized modules. To address architectural scalability, we explore how to efficiently expand model capacity while ensuring the adaptive and reliable use of multimodal inputs, developing a framework that intelligently allocates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
