SURf: Teaching Large Vision-Language Models to Selectively Utilize   Retrieved Information

Jiashuo Sun; Jihai Zhang; Yucheng Zhou; Zhaochen Su; Xiaoye Qu; Yu; Cheng

arXiv:2409.14083·cs.CV·September 24, 2024

SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information

Jiashuo Sun, Jihai Zhang, Yucheng Zhou, Zhaochen Su, Xiaoye Qu, Yu, Cheng

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces SURf, a self-refinement framework that trains large vision-language models to selectively utilize relevant retrieved information, significantly improving their accuracy and robustness across multiple tasks and datasets.

Contribution

The paper presents a novel self-refinement approach that teaches LVLMs to distinguish and use relevant references, addressing limitations of previous methods.

Findings

01

Enhanced LVLM performance across three tasks and seven datasets.

02

Improved robustness against irrelevant or misleading references.

03

Effective fine-tuning method for selective information utilization.

Abstract

Large Vision-Language Models (LVLMs) have become pivotal at the intersection of computer vision and natural language processing. However, the full potential of LVLMs Retrieval-Augmented Generation (RAG) capabilities remains underutilized. Existing works either focus solely on the text modality or are limited to specific tasks. Moreover, most LVLMs struggle to selectively utilize retrieved information and are sensitive to irrelevant or misleading references. To address these challenges, we propose a self-refinement framework designed to teach LVLMs to Selectively Utilize Retrieved Information (SURf). Specifically, when given questions that are incorrectly answered by the LVLM backbone, we obtain references that help correct the answers (positive references) and those that do not (negative references). We then fine-tune the LVLM backbone using a combination of these positive and negative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gasolsun36/surf
pytorchOfficial

Videos

SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information· underline

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsFocus