LIVE: Learnable In-Context Vector for Visual Question Answering

Yingzhe Peng; Chenduo Hao; Xu Yang; Jiawei Peng; Xinting Hu; Xin Geng

arXiv:2406.13185·cs.CL·November 1, 2024

LIVE: Learnable In-Context Vector for Visual Question Answering

Yingzhe Peng, Chenduo Hao, Xu Yang, Jiawei Peng, Xinting Hu, Xin Geng

PDF

Open Access 2 Repos

TL;DR

The paper introduces LIVE, a learnable in-context vector method that improves visual question answering in multimodal models by reducing computational costs and increasing accuracy through distilled task information.

Contribution

LIVE is a novel learnable in-context vector that effectively distills task information, addressing challenges of efficiency and performance in multimodal in-context learning.

Findings

01

LIVE reduces inference time compared to traditional ICL.

02

LIVE improves accuracy on VQA tasks over non-learnable ICV methods.

03

LIVE demonstrates effectiveness in complex multimodal tasks.

Abstract

As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs) as context. Inspired by these advancements, researchers have extended these techniques to develop Large Multimodal Models (LMMs) with ICL capabilities. However, applying ICL usually faces two major challenges: 1) using more ICDs will largely increase the inference time and 2) the performance is sensitive to the selection of ICDs. These challenges are further exacerbated in LMMs due to the integration of multiple data types and the combinational complexity of multimodal ICDs. Recently, to address these challenges, some NLP studies introduce non-learnable In-Context Vectors (ICVs) which extract useful task information from ICDs into a single vector and then insert it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning