VoxelPrompt: A Vision Agent for End-to-End Medical Image Analysis

Andrew Hoopes; Neel Dey; Victor Ion Butoi; John V. Guttag; Adrian V. Dalca

arXiv:2410.08397·eess.IV·October 17, 2025·3 cites

VoxelPrompt: A Vision Agent for End-to-End Medical Image Analysis

Andrew Hoopes, Neel Dey, Victor Ion Butoi, John V. Guttag, Adrian V. Dalca

PDF

Open Access 1 Repo 3 Reviews

TL;DR

VoxelPrompt is an end-to-end medical image analysis agent that uses natural language prompts to automate complex radiological tasks, matching specialist accuracy across diverse neuroimaging applications.

Contribution

It introduces a novel framework combining language models and adaptable vision networks to automate and generalize medical image analysis workflows.

Findings

01

Accurately delineates anatomical and pathological features

02

Measures complex morphological properties

03

Performs open-language lesion analysis

Abstract

We present VoxelPrompt, an end-to-end image analysis agent that tackles free-form radiological tasks. Given any number of volumetric medical images and a natural language prompt, VoxelPrompt integrates a language model that generates executable code to invoke a jointly-trained, adaptable vision network. This code further carries out analytical steps to address practical quantitative aims, such as measuring the growth of a tumor across visits. The pipelines generated by VoxelPrompt automate analyses that currently require practitioners to painstakingly combine multiple specialized vision and statistical tools. We evaluate VoxelPrompt using diverse neuroimaging tasks and show that it can delineate hundreds of anatomical and pathological features, measure complex morphological properties, and perform open-language analysis of lesion characteristics. VoxelPrompt performs these objectives…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 5

Strengths

The proposed method stands out for its originality, offering a fundamentally new approach in both functionality and design compared to existing medical image frameworks. Its flexibility and performance are highly compelling. * The innovative use of code as an output of a language model, enabling segmentation to serve its true role as an intermediate step toward downstream clinical or analytical objectives. * A thorough and convincing evaluation that clearly demonstrates the method’s effectivene

Weaknesses

The primary limitation of the paper lies in the presentation of the proposed method. Its exact capabilities and constraints are not clearly defined. Crucial details regarding the architecture, datasets, and evaluation are relegated to the appendix, forcing the reader to consult supplementary material to grasp the approach. * The use of the term “zero-shot” is misleading, as the model has been exposed to the same ROIs during training; the evaluation therefore reflects domain transfer rather than

Reviewer 02Rating 2Confidence 4

Strengths

The paper is well-written in structure and readers are easy to follow.

Weaknesses

1. An Inefficient and "Weak" Agent: The paper's most significant limitation is that its agent is trained from scratch on a curated, domain-specific dataset. This results in an agent that is, by definition, "weaker" and less capable in its reasoning, language understanding, and code-generation abilities than any modern, general-purpose foundation model (such as the Gemini or GPT series). The field has largely demonstrated that the emergent reasoning and planning capabilities of large-scale models

Reviewer 03Rating 4Confidence 5

Strengths

The joint training of language and vision components for code-based workflow generation is interesting. The code generation approach provides transparency and interpretability compared to black-box vision-language models, which is important for clinical applications.

Weaknesses

1. "End-to-end" is claimed throughout, but most quantitative evaluations are on segmentation subtasks, not complete clinical workflows. 2. Training the language model from scratch on synthetic template-based prompts is a critical limitation explicitly acknowledged: "limits their utility when given entirely unseen prompts". The system can only use predefined library functions, limiting true flexibility. 3. Missing quantitative evaluation of the complex multi-step workflows shown in Figure 1. And

Code & Models

Repositories

dalcalab/voxel
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications