Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

Pengfei Zhang; Tianxin Xie; Minghao Yang; Li Liu

arXiv:2602.15909·eess.AS·March 2, 2026

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu

PDF

Open Access 3 Reviews

TL;DR

Resp-Agent is an innovative multimodal system that combines active learning, clinical text integration, and synthetic data generation to improve respiratory disease diagnosis, especially under data scarcity and class imbalance.

Contribution

The paper introduces Resp-Agent, a novel active adversarial curriculum framework with a modality-weaving Diagnoser and a flow matching Generator for enhanced respiratory sound analysis.

Findings

01

Outperforms prior methods in diagnostic accuracy

02

Improves robustness with limited data

03

Effectively handles class imbalance

Abstract

Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A $^{2}$ CA). Unlike static pipelines, Thinker-A $^{2}$ CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a modality-weaving Diagnoser that weaves clinical text with audio tokens via strategic global attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The most practical and immediate strength is the size and richness of the new corpus. At 408 hours and 229,000 recordings, this scale is required for a specialized medical domain. Crucially, by linking the acoustic data to expert-level annotations distilled from Electronic Health Records (EHRs), this provides the necessary clinical context, which is often missing in public datasets, thus bridging the gap between raw audio and real-world diagnostic complexity. The system is capable of generatin

Weaknesses

The papers fall into the category of applying ML for healthcare. I find it interesting as it combines multiple concepts, but incremental. The evaluation part is weak. The choice of LSTM for the text baseline is weak, and using an older attention mechanism for fusion is also suboptimal. I would suggest comparing against a Transformer-based text encoder (e.g., BERT, RoBERTa, or even a small version of the Longformer) for the text-only task. This isolates whether the performance gain is from the

Reviewer 02Rating 6Confidence 3

Strengths

The paper strengths can be summarised as follows, - Ambitious, unified scope (which includes dataset, generation, diagnosis, and agent loop). The paper introduces a cross-domain multimodal corpus (Resp-229k) with source-disjoint splits. This anchors the contribution in real distribution shift rather than in-domain evaluation. - Clear architectural ideas on both sides of the loop. The Generator upgrades a compact LLM via modality injection (BEATs-derived style tokens) to autoregress discrete aco

Weaknesses

The paper weaknesses can be summarised as follows, - Low macro-F1 in the natural (imbalanced) setting. Before synthetic balancing, macro-F1 is 0.2118 despite high accuracy... this tell me that there might be substantial minority under-diagnosis. However, the paper relies on its own generator to fix this; stronger baselines (e.g., cost-sensitive losses, reweighting, focal/LDAM, mixup/Manifold mixup, class-balanced sampling) should be compared under the same cross-domain split to show generation

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper targets important question that lacking paired multi-modal datasets especially when the text-modality missing. 2. The system design is comprehensive and clear to follow its flow.

Weaknesses

1. Despite the comprehensive and close-loop system. The novelty lies in the assembly of recent advances such as longformer, flow-matching models. The core contribution and innovation is hidden and ambiguous. 2. For the generator part, the Generator does make the system closed-loop: the “Diagnoser” classifies real respiratory sounds, then the “Generator” synthesizes new examples. It's not clear is there of a real feedback loop: the Diagnoser doesn’t meaningfully inform or retrain the Generator;

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhonocardiography and Auscultation Techniques · Machine Learning in Healthcare · COVID-19 diagnosis using AI