TL;DR
Echo-α is a multimodal reasoning model that unifies lesion detection and clinical reasoning for ultrasound interpretation, achieving superior accuracy and interpretability across multiple benchmarks.
Contribution
It introduces an agentic framework that combines specialized detectors with global reasoning, trained via supervised curriculum and reinforcement learning.
Findings
Outperforms baselines on renal and breast ultrasound benchmarks.
Achieves 56.73%/43.78% [email protected] for grounding on cross-center tests.
Reaches 74.90%/49.20% accuracy in diagnosis for renal/breast ultrasound.
Abstract
Ultrasound interpretation requires both precise lesion localization and holistic clinical reasoning, yet existing methods typically excel at only one of these capabilities: specialized detectors offer strong localization but limited reasoning, whereas multimodal large language models (MLLMs) provide flexible reasoning but weak grounding in specialized medical domains. We present Echo-{\alpha}, an agentic multimodal reasoning model for ultrasound interpretation that unifies these strengths within an invoke-and-reason framework. Echo-{\alpha} is trained to coordinate organ-specific detector outputs, integrate them with global visual context, and convert the resulting evidence into grounded diagnostic decisions beyond detector-only inference. This behavior is established through a nine-task supervised curriculum and then refined by sequential reinforcement learning under different reward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
