Au-M-ol: A Unified Model for Medical Audio and Language Understanding

Meizhu Liu; Nistha Mitra; Paul Li; Amine Abdaoui; Adam Ledyard; Tao Sheng

arXiv:2604.23284·cs.CL·April 28, 2026

Au-M-ol: A Unified Model for Medical Audio and Language Understanding

Meizhu Liu, Nistha Mitra, Paul Li, Amine Abdaoui, Adam Ledyard, Tao Sheng

PDF

TL;DR

Au-M-ol is a new multimodal model that combines audio processing with large language models to enhance medical speech recognition and understanding, especially in challenging clinical environments.

Contribution

It introduces a novel architecture integrating audio encoders with LLMs for improved medical audio interpretation and robustness.

Findings

01

Reduces Word Error Rate by 56% on medical transcription tasks.

02

Performs well in noisy, domain-specific, and speaker-variable conditions.

03

Demonstrates potential for real-world clinical audio applications.

Abstract

In this work, we present Au-M-ol, a novel multimodal architecture that extends Large Language Models (LLMs) with audio processing. It is designed to improve performance on clinically relevant tasks such as Automatic Speech Recognition (ASR). Au-M-ol has three main components: (1) an audio encoder that extracts rich acoustic features from medical speech, (2) an adaptation layer that maps audio features into the LLM input space, and (3) a pretrained LLM that performs transcription and clinical language understanding. This design allows the model to interpret spoken medical content directly, improving both accuracy and robustness. In experiments, Au-M-ol reduces Word Error Rate (WER) by 56\% compared to state-of-the-art baselines on medical transcription tasks. The model also performs well in challenging conditions, including noisy environments, domain-specific terminology, and speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.