Test-Time Computing for Referring Multimodal Large Language Models

Mingrui Wu; Hao Chen; Jiayi Ji; Xiaoshuai Sun; Zhiyuan Liu; Liujuan Cao; Ming-Ming Cheng; Rongrong Ji

arXiv:2602.19505·cs.CV·February 24, 2026

Test-Time Computing for Referring Multimodal Large Language Models

Mingrui Wu, Hao Chen, Jiayi Ji, Xiaoshuai Sun, Zhiyuan Liu, Liujuan Cao, Ming-Ming Cheng, Rongrong Ji

PDF

Open Access

TL;DR

ControlMLLM++ introduces a test-time adaptation method for multimodal large language models that uses visual prompts and attention steering to enable fine-grained visual reasoning without retraining.

Contribution

It presents a novel framework that injects learnable visual prompts during inference, improving reasoning and interpretability without model fine-tuning.

Findings

01

Effective across diverse visual prompt types

02

Strong out-of-domain generalization

03

Enhanced interpretability of visual reasoning

Abstract

We propose ControlMLLM++, a novel test-time adaptation framework that injects learnable visual prompts into frozen multimodal large language models (MLLMs) to enable fine-grained region-based visual reasoning without any model retraining or fine-tuning. Leveraging the insight that cross-modal attention maps intrinsically encode semantic correspondences between textual tokens and visual regions, ControlMLLM++ optimizes a latent visual token modifier during inference via a task-specific energy function to steer model attention towards user-specified areas. To enhance optimization stability and mitigate language prompt biases, ControlMLLM++ incorporates an improved optimization strategy (Optim++) and a prompt debiasing mechanism (PromptDebias). Supporting diverse visual prompt types including bounding boxes, masks, scribbles, and points, our method demonstrates strong out-of-domain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis