CoTBox-TTT: Grounding Medical VQA with Visual Chain-of-Thought Boxes During Test-time Training

Jiahe Qian; Yuhao Shen; Zhangtianyi Chen; Juexiao Zhou; Peisong Wang

arXiv:2511.12446·cs.CV·November 18, 2025

CoTBox-TTT: Grounding Medical VQA with Visual Chain-of-Thought Boxes During Test-time Training

Jiahe Qian, Yuhao Shen, Zhangtianyi Chen, Juexiao Zhou, Peisong Wang

PDF

Open Access

TL;DR

CoTBox-TTT enhances medical visual question answering by adaptively grounding evidence during inference through a label-free, prompt-based approach, significantly improving accuracy under domain shifts.

Contribution

It introduces a novel test-time training method that adapts vision-language models using visual chain-of-thought signals without additional labels.

Findings

01

Increases accuracy by 12.3% on pathVQA.

02

Effective under domain shift conditions.

03

Plug-and-play with various backbones.

Abstract

Medical visual question answering could support clinical decision making, yet current systems often fail under domain shift and produce answers that are weakly grounded in image evidence. This reliability gap arises when models attend to spurious regions and when retraining or additional labels are impractical at deployment time. We address this setting with CoTBox-TTT, an evidence-first test-time training approach that adapts a vision-language model at inference while keeping all backbones frozen. The method updates only a small set of continuous soft prompts. It identifies question-relevant regions through a visual chain-of-thought signal and encourages answer consistency across the original image and a localized crop. The procedure is label free, and plug and play with diverse backbones. Experiments on medical VQA show that the approach is practical for real deployments. For…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling