Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus

Johannes Moll; Jannik L\"ubberstedt; Christoph Nuernbergk; Jacob Stroh; Luisa Mertens; Anna Purcarea; Christopher Zirn; Zeineb Benchaaben; Fabian Drexel; Hartmut H\"antze; Anirudh Narayanan; Friedrich Puttkammer; Andrei Zhukov; Jacqueline Lammert; Sebastian Ziegelmayer; Markus Graf; Marion H\"ogner; Marcus Makowski; Florian Bassermann; Lisa C. Adams; Jiazhen Pan; Daniel Rueckert; Krischan Braitsch; and Keno K. Bressem

arXiv:2604.24473·cs.AI·April 28, 2026

Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus

Johannes Moll, Jannik L\"ubberstedt, Christoph Nuernbergk, Jacob Stroh, Luisa Mertens, Anna Purcarea, Christopher Zirn, Zeineb Benchaaben, Fabian Drexel, Hartmut H\"antze, Anirudh Narayanan, Friedrich Puttkammer, Andrei Zhukov, Jacqueline Lammert, Sebastian Ziegelmayer

PDF

TL;DR

This study evaluates an agentic reasoning system for synthesizing longitudinal myeloma records, demonstrating it surpasses baseline methods in expert-level agreement, especially on complex cases, but highlights the need for prospective validation.

Contribution

The paper introduces an agentic reasoning approach that outperforms retrieval-augmented generation baselines in clinical record synthesis for myeloma, especially on complex and lengthy cases.

Findings

01

Agentic system achieved 79.6% concordance, exceeding baselines.

02

Gains increased with question complexity and record length.

03

System errors were often clinically significant, comparable to expert disagreement.

Abstract

Multiple myeloma is managed through sequential lines of therapy over years to decades, with each decision depending on cumulative disease history distributed across dozens to hundreds of heterogeneous clinical documents. Whether LLM-based systems can synthesise this evidence at a level approaching expert agreement has not been established. A retrospective evaluation was conducted on longitudinal clinical records of 811 myeloma patients treated at a tertiary centre (2001-2026), covering 44,962 documents and 1,334,677 laboratory values, with external validation on MIMIC-IV. An agentic reasoning system was compared against single-pass retrieval-augmented generation (RAG), iterative RAG, and full-context input on 469 patient-question pairs from 48 templates at three complexity levels. Reference labels came from double annotation by four oncologists with senior haematologist adjudication.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.