Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models

Umberto Cappellazzo; Xubo Liu; Pingchuan Ma; Stavros Petridis; Maja Pantic

arXiv:2511.07253·eess.AS·January 28, 2026

Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models

Umberto Cappellazzo, Xubo Liu, Pingchuan Ma, Stavros Petridis, Maja Pantic

PDF

Open Access

TL;DR

Omni-AVSR introduces a unified multimodal speech recognition framework leveraging large language models, enabling efficient training and deployment across audio, visual, and combined modalities with competitive accuracy and robustness.

Contribution

The paper proposes Omni-AVSR, a novel unified LLM-based model for multimodal speech recognition that reduces training resources and enhances flexibility through multi-granularity training and parameter-efficient adaptation.

Findings

01

Achieves comparable or better accuracy than state-of-the-art models.

02

Reduces training and deployment resource use significantly.

03

Maintains robustness under acoustic noise.

Abstract

Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities, including Auditory Speech Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech Recognition (AVSR). Despite this progress, current LLM-based approaches typically address each task independently, training separate models that raise computational and deployment resource use while missing potential cross-task synergies. They also rely on fixed-rate token compression, which restricts flexibility in balancing accuracy with efficiency. These limitations highlight the need for a unified framework that can support ASR, VSR, and AVSR while enabling elastic inference. To this end, we present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation. Specifically, we adapt the matryoshka…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing