ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models
Jingwei Yi, Junhao Yin, Ju Xu, Peng Bao, Yongliang Wang, Wei Fan, Hao, Wang

TL;DR
This paper introduces ImageRef-VL, a method that significantly improves the ability of vision-language models to reference relevant images in conversations, addressing a key limitation in current multimodal chatbots.
Contribution
The paper presents the first evaluation dataset and metrics for contextual image referencing, and proposes ImageRef-VL, a fine-tuning approach that enhances open-source VLMs' referencing capabilities.
Findings
ImageRef-VL outperforms proprietary models in referencing accuracy.
88% performance improvement over state-of-the-art open-source VLMs.
First systematic evaluation of contextual image referencing in VLMs.
Abstract
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in understanding multimodal inputs and have been widely integrated into Retrieval-Augmented Generation (RAG) based conversational systems. While current VLM-powered chatbots can provide textual source references in their responses, they exhibit significant limitations in referencing contextually relevant images during conversations. In this paper, we introduce Contextual Image Reference -- the ability to appropriately reference relevant images from retrieval documents based on conversation context -- and systematically investigate VLMs' capability in this aspect. We conduct the first evaluation for contextual image referencing, comprising a dedicated testing dataset and evaluation metrics. Furthermore, we propose ImageRef-VL, a method that significantly enhances open-source VLMs' image referencing capabilities…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
