MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

Tim Cofala; Christian Kalfar; Jingge Xiao; Johanna Schrader; Michelle Tang; Wolfgang Nejdl

arXiv:2512.11682·cs.AI·December 15, 2025

MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

Tim Cofala, Christian Kalfar, Jingge Xiao, Johanna Schrader, Michelle Tang, Wolfgang Nejdl

PDF

Open Access

TL;DR

This paper evaluates TxAgent, an agentic AI system for therapeutic decision-making, demonstrating how retrieval-augmented reasoning and tool integration improve safety and accuracy in clinical AI applications, as showcased in the NeurIPS CURE-Bench Challenge.

Contribution

It introduces a novel evaluation protocol for medical AI reasoning and tool usage, highlighting the impact of retrieval quality on therapeutic decision-making performance.

Findings

01

Retrieval quality significantly affects model accuracy.

02

Improved tool-retrieval strategies enhance reasoning performance.

03

TxAgent achieved the NeurIPS CURE-Bench Excellence Award.

Abstract

Therapeutic decision-making in clinical medicine constitutes a high-stakes domain in which AI guidance interacts with complex interactions among patient characteristics, disease processes, and pharmacological agents. Tasks such as drug recommendation, treatment planning, and adverse-effect prediction demand robust, multi-step reasoning grounded in reliable biomedical knowledge. Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval-augmented generation (RAG). TxAgent employs a fine-tuned Llama-3.1-8B model that dynamically generates and executes function calls to a unified biomedical tool suite (ToolUniverse), integrating FDA Drug API, OpenTargets, and Monarch resources to ensure access to current therapeutic information. In contrast to general-purpose RAG systems, medical applications impose stringent safety constraints, rendering the accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Biomedical Text Mining and Ontologies · Topic Modeling