VoxelPrompt: A Vision Agent for End-to-End Medical Image Analysis
Andrew Hoopes, Neel Dey, Victor Ion Butoi, John V. Guttag, Adrian V. Dalca

TL;DR
VoxelPrompt is an end-to-end medical image analysis agent that uses natural language prompts to automate complex radiological tasks, matching specialist accuracy across diverse neuroimaging applications.
Contribution
It introduces a novel framework combining language models and adaptable vision networks to automate and generalize medical image analysis workflows.
Findings
Accurately delineates anatomical and pathological features
Measures complex morphological properties
Performs open-language lesion analysis
Abstract
We present VoxelPrompt, an end-to-end image analysis agent that tackles free-form radiological tasks. Given any number of volumetric medical images and a natural language prompt, VoxelPrompt integrates a language model that generates executable code to invoke a jointly-trained, adaptable vision network. This code further carries out analytical steps to address practical quantitative aims, such as measuring the growth of a tumor across visits. The pipelines generated by VoxelPrompt automate analyses that currently require practitioners to painstakingly combine multiple specialized vision and statistical tools. We evaluate VoxelPrompt using diverse neuroimaging tasks and show that it can delineate hundreds of anatomical and pathological features, measure complex morphological properties, and perform open-language analysis of lesion characteristics. VoxelPrompt performs these objectives…
Peer Reviews
Decision·Submitted to ICLR 2026
The proposed method stands out for its originality, offering a fundamentally new approach in both functionality and design compared to existing medical image frameworks. Its flexibility and performance are highly compelling. * The innovative use of code as an output of a language model, enabling segmentation to serve its true role as an intermediate step toward downstream clinical or analytical objectives. * A thorough and convincing evaluation that clearly demonstrates the method’s effectivene
The primary limitation of the paper lies in the presentation of the proposed method. Its exact capabilities and constraints are not clearly defined. Crucial details regarding the architecture, datasets, and evaluation are relegated to the appendix, forcing the reader to consult supplementary material to grasp the approach. * The use of the term “zero-shot” is misleading, as the model has been exposed to the same ROIs during training; the evaluation therefore reflects domain transfer rather than
The paper is well-written in structure and readers are easy to follow.
1. An Inefficient and "Weak" Agent: The paper's most significant limitation is that its agent is trained from scratch on a curated, domain-specific dataset. This results in an agent that is, by definition, "weaker" and less capable in its reasoning, language understanding, and code-generation abilities than any modern, general-purpose foundation model (such as the Gemini or GPT series). The field has largely demonstrated that the emergent reasoning and planning capabilities of large-scale models
The joint training of language and vision components for code-based workflow generation is interesting. The code generation approach provides transparency and interpretability compared to black-box vision-language models, which is important for clinical applications.
1. "End-to-end" is claimed throughout, but most quantitative evaluations are on segmentation subtasks, not complete clinical workflows. 2. Training the language model from scratch on synthetic template-based prompts is a critical limitation explicitly acknowledged: "limits their utility when given entirely unseen prompts". The system can only use predefined library functions, limiting true flexibility. 3. Missing quantitative evaluation of the complex multi-step workflows shown in Figure 1. And
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
