Video Enriched Retrieval Augmented Generation Using Aligned Video   Captions

Kevin Dela Rosa

arXiv:2405.17706·cs.AI·May 29, 2024

Video Enriched Retrieval Augmented Generation Using Aligned Video Captions

Kevin Dela Rosa

PDF

Open Access 1 Repo

TL;DR

This paper introduces aligned visual captions as a textual representation of video content to improve retrieval augmented generation chat systems, enabling efficient and adaptable integration of video information into large language models.

Contribution

It proposes using aligned visual captions for video content, creating a new dataset, and establishing evaluation procedures for RAG tasks involving videos.

Findings

01

Aligned captions effectively summarize video content for LLMs.

02

Captions reduce the need for extensive multimedia data in prompts.

03

The dataset and evaluation methods facilitate future research in video RAG.

Abstract

In this work, we propose the use of "aligned visual captions" as a mechanism for integrating information contained within videos into retrieval augmented generation (RAG) based chat assistant systems. These captions are able to describe the visual and audio content of videos in a large corpus while having the advantage of being in a textual format that is both easy to reason about & incorporate into large language model (LLM) prompts, but also typically require less multimedia content to be inserted into the multimodal LLM context window, where typical configurations can aggressively fill up the context window by sampling video frames from the source video. Furthermore, visual captions can be adapted to specific use cases by prompting the original foundational model / captioner for particular visual details or fine tuning. In hopes of helping advancing progress in this area, we curate a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kdr/videorag-mrr2024
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Linear Warmup With Linear Decay · Weight Decay · Attention Dropout · Linear Layer · Byte Pair Encoding · Adam · Residual Connection