ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large   Language Models

Mingrui Wu; Xinyue Cai; Jiayi Ji; Jiale Li; Oucheng Huang; Gen Luo,; Hao Fei; Guannan Jiang; Xiaoshuai Sun; Rongrong Ji

arXiv:2407.21534·cs.CV·January 8, 2025·1 cites

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

Mingrui Wu, Xinyue Cai, Jiayi Ji, Jiale Li, Oucheng Huang, Gen Luo,, Hao Fei, Guannan Jiang, Xiaoshuai Sun, Rongrong Ji

PDF

Open Access 1 Repo

TL;DR

This paper introduces a training-free visual prompt learning method for Multimodal Large Language Models, enabling detailed region referencing and reasoning without retraining, by optimizing a latent variable at test time.

Contribution

It proposes a novel test-time optimization approach to inject visual prompts into MLLMs, enhancing referring capabilities without additional training or fine-tuning.

Findings

01

Out-of-domain generalization demonstrated

02

Supports various referring modalities (box, mask, scribble, point)

03

Improves interpretability of MLLMs

Abstract

In this work, we propose a training-free method to inject visual prompts into Multimodal Large Language Models (MLLMs) through test-time optimization of a learnable latent variable. We observe that attention, as the core module of MLLMs, connects text prompt tokens and visual tokens, ultimately determining the final results. Our approach involves adjusting visual tokens from the MLP output at test time, controlling the attention response to ensure text prompt tokens attend to visual tokens in referring regions. We optimize a learnable latent variable based on an energy function, enhancing the strength of referring regions in the attention map. This enables detailed region description and reasoning without the need for substantial training costs or model retraining. Our method offers a promising direction for integrating referring abilities into MLLMs, and supports referring with box,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mrwu-mac/controlmllm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsSoftmax · Attention Is All You Need