Au-M-ol: A Unified Model for Medical Audio and Language Understanding
Meizhu Liu, Nistha Mitra, Paul Li, Amine Abdaoui, Adam Ledyard, Tao Sheng

TL;DR
Au-M-ol is a new multimodal model that combines audio processing with large language models to enhance medical speech recognition and understanding, especially in challenging clinical environments.
Contribution
It introduces a novel architecture integrating audio encoders with LLMs for improved medical audio interpretation and robustness.
Findings
Reduces Word Error Rate by 56% on medical transcription tasks.
Performs well in noisy, domain-specific, and speaker-variable conditions.
Demonstrates potential for real-world clinical audio applications.
Abstract
In this work, we present Au-M-ol, a novel multimodal architecture that extends Large Language Models (LLMs) with audio processing. It is designed to improve performance on clinically relevant tasks such as Automatic Speech Recognition (ASR). Au-M-ol has three main components: (1) an audio encoder that extracts rich acoustic features from medical speech, (2) an adaptation layer that maps audio features into the LLM input space, and (3) a pretrained LLM that performs transcription and clinical language understanding. This design allows the model to interpret spoken medical content directly, improving both accuracy and robustness. In experiments, Au-M-ol reduces Word Error Rate (WER) by 56\% compared to state-of-the-art baselines on medical transcription tasks. The model also performs well in challenging conditions, including noisy environments, domain-specific terminology, and speaker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
