ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language   Models

Jingwei Yi; Junhao Yin; Ju Xu; Peng Bao; Yongliang Wang; Wei Fan; Hao; Wang

arXiv:2501.12418·cs.CV·January 23, 2025

ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models

Jingwei Yi, Junhao Yin, Ju Xu, Peng Bao, Yongliang Wang, Wei Fan, Hao, Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces ImageRef-VL, a method that significantly improves the ability of vision-language models to reference relevant images in conversations, addressing a key limitation in current multimodal chatbots.

Contribution

The paper presents the first evaluation dataset and metrics for contextual image referencing, and proposes ImageRef-VL, a fine-tuning approach that enhances open-source VLMs' referencing capabilities.

Findings

01

ImageRef-VL outperforms proprietary models in referencing accuracy.

02

88% performance improvement over state-of-the-art open-source VLMs.

03

First systematic evaluation of contextual image referencing in VLMs.

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in understanding multimodal inputs and have been widely integrated into Retrieval-Augmented Generation (RAG) based conversational systems. While current VLM-powered chatbots can provide textual source references in their responses, they exhibit significant limitations in referencing contextually relevant images during conversations. In this paper, we introduce Contextual Image Reference -- the ability to appropriately reference relevant images from retrieval documents based on conversation context -- and systematically investigate VLMs' capability in this aspect. We conduct the first evaluation for contextual image referencing, comprising a dedicated testing dataset and evaluation metrics. Furthermore, we propose ImageRef-VL, a method that significantly enhances open-source VLMs' image referencing capabilities…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bytedance/imageref-vl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques