H-LDM: Hierarchical Latent Diffusion Models for Controllable and Interpretable PCG Synthesis from Clinical Metadata
Chenyang Xu, Siming Li, Hao Wang

TL;DR
H-LDM is a hierarchical latent diffusion model that generates controllable, clinically accurate PCG signals from metadata, improving data augmentation for cardiac diagnosis and enabling interpretability.
Contribution
It introduces a multi-scale VAE and a hierarchical pipeline with a novel Medical Attention module for physiologically-disentangled, controllable PCG synthesis from clinical metadata.
Findings
Achieves state-of-the-art Fréchet Audio Distance of 9.7
92% attribute disentanglement score
87.1% clinical validity confirmed by cardiologists
Abstract
Phonocardiogram (PCG) analysis is vital for cardiovascular disease diagnosis, yet the scarcity of labeled pathological data hinders the capability of AI systems. To bridge this, we introduce H-LDM, a Hierarchical Latent Diffusion Model for generating clinically accurate and controllable PCG signals from structured metadata. Our approach features: (1) a multi-scale VAE that learns a physiologically-disentangled latent space, separating rhythm, heart sounds, and murmurs; (2) a hierarchical text-to-biosignal pipeline that leverages rich clinical metadata for fine-grained control over 17 distinct conditions; and (3) an interpretable diffusion process guided by a novel Medical Attention module. Experiments on the PhysioNet CirCor dataset demonstrate state-of-the-art performance, achieving a Fr\'echet Audio Distance of 9.7, a 92% attribute disentanglement score, and 87.1% clinical validity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhonocardiography and Auscultation Techniques · Voice and Speech Disorders · COVID-19 diagnosis using AI
