Patient-Level Multimodal Question Answering from Multi-Site Auscultation Recordings

Fan Wu; Tsai-Ning Wang; Nicolas Zumarraga; Ning Wang; Markus Kreft; Kevin O'Sullivan; Elgar Fleisch; Oliver Aalami; Paul Schmiedmayer; Robert Jakob; Patrick Langer

arXiv:2603.13362·cs.SD·March 17, 2026

Patient-Level Multimodal Question Answering from Multi-Site Auscultation Recordings

Fan Wu, Tsai-Ning Wang, Nicolas Zumarraga, Ning Wang, Markus Kreft, Kevin O'Sullivan, Elgar Fleisch, Oliver Aalami, Paul Schmiedmayer, Robert Jakob, Patrick Langer

PDF

Open Access

TL;DR

This paper introduces a novel framework for patient-level multimodal auscultation analysis that aligns multi-site recordings with large language models, achieving state-of-the-art results and enabling holistic clinical assessment.

Contribution

The study presents a new method that aligns auscultation recordings with LLMs using gated cross-attention, improving patient-level diagnosis beyond traditional classification.

Findings

01

Achieved 0.865 F1-macro on CaReSound benchmark.

02

Lightweight domain-specific encoders perform comparably to large ALMs.

03

Multi-site aggregation enhances robustness against temporal truncation.

Abstract

Auscultation is a vital diagnostic tool, yet its utility is often limited by subjective interpretation. While general-purpose Audio-Language Models (ALMs) excel in general domains, they struggle with the nuances of physiological signals. We propose a framework that aligns multi-site auscultation recordings directly with a frozen Large Language Model (LLM) embedding space via gated cross-attention. By leveraging the LLM's latent world knowledge, our approach moves beyond isolated classification toward holistic, patient-level assessment. On the CaReSound benchmark, our model achieves a state-of-the-art 0.865 F1-macro and 0.952 BERTScore. We demonstrate that lightweight, domain-specific encoders rival large-scale ALMs and that multi-site aggregation provides spatial redundancy that mitigates temporal truncation. This alignment of medical acoustics with text foundations offers a scalable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhonocardiography and Auscultation Techniques · Voice and Speech Disorders · COVID-19 diagnosis using AI