RAVEN: Multitask Retrieval Augmented Vision-Language Learning

Varun Nagaraj Rao; Siddharth Choudhary; Aditya Deshpande; Ravi Kumar; Satzoda; Srikar Appalaraju

arXiv:2406.19150·cs.CV·June 28, 2024·1 cites

RAVEN: Multitask Retrieval Augmented Vision-Language Learning

Varun Nagaraj Rao, Siddharth Choudhary, Aditya Deshpande, Ravi Kumar, Satzoda, Srikar Appalaraju

PDF

Open Access

TL;DR

RAVEN introduces a multitask retrieval-augmented vision-language framework that improves performance on various tasks without additional parameters, making multimodal learning more efficient and accessible.

Contribution

It presents a novel multitask retrieval-augmented VLM framework that enhances existing models through efficient fine-tuning without extra retrieval-specific parameters.

Findings

01

+1 CIDEr on MSCOCO

02

+4 CIDEr on NoCaps

03

+3% accuracy on VQA questions

Abstract

The scaling of large language models to encode all the world's knowledge in model parameters is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. Existing methods focus on models designed for single tasks. Furthermore, they're limited by the need for resource intensive pre training, additional parameter requirements, unaddressed modality prioritization and lack of clear benefit over non-retrieval baselines. This paper introduces RAVEN, a multitask retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning. By integrating retrieval augmented samples without the need for additional retrieval-specific parameters, we show that the model acquires retrieval properties that are effective across multiple tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · WordPiece · Softmax · Layer Normalization · Linear Warmup With Linear Decay · Byte Pair Encoding · Attention Dropout · Dropout