AutoMiSeg: Automatic Medical Image Segmentation via Test-Time Adaptation of Foundation Models

Xingjian Li; Qifeng Wu; Adithya S. Ubaradka; Yiran Ding; Colleen Que; Runmin Jiang; Jianhua Xing; Tianyang Wang; Min Xu

arXiv:2505.17931·cs.CV·October 7, 2025

AutoMiSeg: Automatic Medical Image Segmentation via Test-Time Adaptation of Foundation Models

Xingjian Li, Qifeng Wu, Adithya S. Ubaradka, Yiran Ding, Colleen Que, Runmin Jiang, Jianhua Xing, Tianyang Wang, Min Xu

PDF

4 Reviews

TL;DR

AutoMiSeg introduces a zero-shot, fully automatic medical image segmentation pipeline that leverages foundation models and test-time adaptation, significantly improving accuracy across diverse datasets without requiring manual annotations.

Contribution

The paper presents a novel zero-shot segmentation method combining foundation models with test-time adaptation and Bayesian optimization, eliminating the need for extensive annotations or prompts.

Findings

01

Achieves 69% relative improvement in Dice score over previous methods.

02

Demonstrates strong performance across seven diverse medical imaging datasets.

03

Outperforms previous best methods in accuracy without manual annotations.

Abstract

Medical image segmentation is vital for clinical diagnosis, yet current deep learning methods often demand extensive expert effort, i.e., either through annotating large training datasets or providing prompts at inference time for each new case. This paper introduces a zero-shot and automatic segmentation pipeline that combines off-the-shelf vision-language and segmentation foundation models. Given a medical image and a task definition (e.g., "segment the optic disc in an eye fundus image"), our method uses a grounding model to generate an initial bounding box, followed by a visual prompt boosting module that enhance the prompts, which are then processed by a promptable segmentation model to produce the final mask. To address the challenges of domain gap and result verification, we introduce a test-time adaptation framework featuring a set of learnable adaptors that align the medical…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 6Confidence 5

Strengths

1. The method is extensively evaluated on seven diverse medical imaging datasets, where it substantially outperforms all existing zero-shot baselines and achieves competitive performance with weakly-supervised interactive models. 2. The Bayesian Optimization (BO) framework for Test-Time Adaptation is practical and efficient. It is guided by the custom proxy validator, whose score is demonstrated to have a strong correlation with the true segmentation performance (as measured by the Dice score)

Weaknesses

1. A notable methodological limitation lies in the design of the Proxy Validator. The validation process relies on evaluating the segmented region in isolation by masking out the background. While this is an effective proxy in many scenarios, it overlooks a fundamental aspect of medical image interpretation: context can be critical for accurate identification. By isolating the candidate region, the validator discards this contextual information. Although the results demonstrate a strong correlat

Reviewer 02Rating 2Confidence 4

Strengths

1. The proposed method is well-motivated in being fully automatic and training-free, thereby enabling zero-shot medical image segmentation. 2. The authors introduce Learnable Test-time Adaptors (LTAs) and a surrogate validation model to evaluate the segmentation outputs, whose feedback is utilized to optimize the LTAs.

Weaknesses

1. Limited Novelty: The core technical components of the pipeline are not novel. The paper primarily combines existing techniques: a grounding model (CogVLM/Grounding DINO), a feature-based prompt booster (inspired by CoVP), and a segmenter (SAM) are connected sequentially. The novelty lies in the specific integration and application to this problem, rather than in the invention of new core algorithms. 2. The evaluation of zero-shot baselines may not be fully comprehensive. Notably, the cited Sa

Reviewer 03Rating 4Confidence 4

Strengths

1. This work is likely the first to combine zero-shot segmentation based on foundation models with test-time adaptation, forming a fully automatic and training-free system. 2. The modular architecture is well-structured, achieving a stable compositional zero-shot inference process through a clear task decomposition (Grounding → Prompt Boosting → Segmentation). 3. The experiments cover 7 medical imaging datasets (including fundus, ultrasound, MRI, endoscopy, and dermoscopy), providing compr

Weaknesses

1. The core technical novelty is relatively limited, as the main contribution lies in system integration and pipeline optimization, while each module is built upon existing public methods. It is recommended to open-source the full zero-shot segmentation pipeline, which would significantly enhance the paper’s community value. 2. The Bayesian Optimization process involves 100 iterations per dataset, yet the **inference time and computational cost** are not sufficiently reported or discussed.

Reviewer 04Rating 6Confidence 4

Strengths

- Clear motivation and solid empirical support: The manuscript is generally well written and easy to follow. The identified problem is timely, and the proposed approach achieves competitive (often SOTA) performance according to the reported experiments. - Ingenious combination of known components: Test-time augmentation, AutoML/Bayesian optimization, and prompt refinement are combined to address complementary facets of the problem in a thoughtful way. - Automation and practicality: The pipeline

Weaknesses

- Limited exposition of module interactions: While each component is described, the paper would benefit from a clearer explanation—ideally a schematic—of how the modules interact end to end (prompting - refinement - TTA - Bayesian optimization). Empirical analysis of interdependencies is also missing. In particular, please assess robustness when upstream (pre-trained) elements underperform: How sensitive is the pipeline to weaker grounding models or to degradation/failures in the refinement bloc

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.