LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language   Models for Referring Expression Comprehension

Amaia Cardiel; Eloi Zablocki; Elias Ramzi; Oriane Sim\'eoni; Matthieu; Cord

arXiv:2409.11919·cs.CV·March 7, 2025

LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension

Amaia Cardiel, Eloi Zablocki, Elias Ramzi, Oriane Sim\'eoni, Matthieu, Cord

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces LLM-wrapper, a novel black-box method that leverages large language models to adapt vision-language models for referring expression comprehension without requiring access to their internal parameters.

Contribution

It presents a versatile approach that enables black-box adaptation of VLMs using LLMs, improving performance on REC tasks across various datasets and models.

Findings

01

Significant performance improvements on multiple datasets.

02

Versatile method applicable to any VLM and LLM.

03

Enables ensemble VLM adaptation without internal model access.

Abstract

Vision Language Models (VLMs) have demonstrated remarkable capabilities in various open-vocabulary tasks, yet their zero-shot performance lags behind task-specific fine-tuned models, particularly in complex tasks like Referring Expression Comprehension (REC). Fine-tuning usually requires 'white-box' access to the model's architecture and weights, which is not always feasible due to proprietary or privacy concerns. In this work, we propose LLM-wrapper, a method for 'black-box' adaptation of VLMs for the REC task using Large Language Models (LLMs). LLM-wrapper capitalizes on the reasoning abilities of LLMs, improved with a light fine-tuning, to select the most relevant bounding box matching the referring expression, from candidates generated by a zero-shot black-box VLM. Our approach offers several advantages: it enables the adaptation of closed-source models without needing access to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

valeoai/LLM_wrapper
pytorchOfficial

Videos

LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques