Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis
Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu

TL;DR
Resp-Agent is an innovative multimodal system that combines active learning, clinical text integration, and synthetic data generation to improve respiratory disease diagnosis, especially under data scarcity and class imbalance.
Contribution
The paper introduces Resp-Agent, a novel active adversarial curriculum framework with a modality-weaving Diagnoser and a flow matching Generator for enhanced respiratory sound analysis.
Findings
Outperforms prior methods in diagnostic accuracy
Improves robustness with limited data
Effectively handles class imbalance
Abstract
Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-ACA). Unlike static pipelines, Thinker-ACA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a modality-weaving Diagnoser that weaves clinical text with audio tokens via strategic global attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data…
Peer Reviews
Decision·ICLR 2026 Poster
The most practical and immediate strength is the size and richness of the new corpus. At 408 hours and 229,000 recordings, this scale is required for a specialized medical domain. Crucially, by linking the acoustic data to expert-level annotations distilled from Electronic Health Records (EHRs), this provides the necessary clinical context, which is often missing in public datasets, thus bridging the gap between raw audio and real-world diagnostic complexity. The system is capable of generatin
The papers fall into the category of applying ML for healthcare. I find it interesting as it combines multiple concepts, but incremental. The evaluation part is weak. The choice of LSTM for the text baseline is weak, and using an older attention mechanism for fusion is also suboptimal. I would suggest comparing against a Transformer-based text encoder (e.g., BERT, RoBERTa, or even a small version of the Longformer) for the text-only task. This isolates whether the performance gain is from the
The paper strengths can be summarised as follows, - Ambitious, unified scope (which includes dataset, generation, diagnosis, and agent loop). The paper introduces a cross-domain multimodal corpus (Resp-229k) with source-disjoint splits. This anchors the contribution in real distribution shift rather than in-domain evaluation. - Clear architectural ideas on both sides of the loop. The Generator upgrades a compact LLM via modality injection (BEATs-derived style tokens) to autoregress discrete aco
The paper weaknesses can be summarised as follows, - Low macro-F1 in the natural (imbalanced) setting. Before synthetic balancing, macro-F1 is 0.2118 despite high accuracy... this tell me that there might be substantial minority under-diagnosis. However, the paper relies on its own generator to fix this; stronger baselines (e.g., cost-sensitive losses, reweighting, focal/LDAM, mixup/Manifold mixup, class-balanced sampling) should be compared under the same cross-domain split to show generation
1. The paper targets important question that lacking paired multi-modal datasets especially when the text-modality missing. 2. The system design is comprehensive and clear to follow its flow.
1. Despite the comprehensive and close-loop system. The novelty lies in the assembly of recent advances such as longformer, flow-matching models. The core contribution and innovation is hidden and ambiguous. 2. For the generator part, the Generator does make the system closed-loop: the “Diagnoser” classifies real respiratory sounds, then the “Generator” synthesizes new examples. It's not clear is there of a real feedback loop: the Diagnoser doesn’t meaningfully inform or retrain the Generator;
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhonocardiography and Auscultation Techniques · Machine Learning in Healthcare · COVID-19 diagnosis using AI
