Video Enriched Retrieval Augmented Generation Using Aligned Video Captions
Kevin Dela Rosa

TL;DR
This paper introduces aligned visual captions as a textual representation of video content to improve retrieval augmented generation chat systems, enabling efficient and adaptable integration of video information into large language models.
Contribution
It proposes using aligned visual captions for video content, creating a new dataset, and establishing evaluation procedures for RAG tasks involving videos.
Findings
Aligned captions effectively summarize video content for LLMs.
Captions reduce the need for extensive multimedia data in prompts.
The dataset and evaluation methods facilitate future research in video RAG.
Abstract
In this work, we propose the use of "aligned visual captions" as a mechanism for integrating information contained within videos into retrieval augmented generation (RAG) based chat assistant systems. These captions are able to describe the visual and audio content of videos in a large corpus while having the advantage of being in a textual format that is both easy to reason about & incorporate into large language model (LLM) prompts, but also typically require less multimedia content to be inserted into the multimodal LLM context window, where typical configurations can aggressively fill up the context window by sampling video frames from the source video. Furthermore, visual captions can be adapted to specific use cases by prompting the original foundational model / captioner for particular visual details or fine tuning. In hopes of helping advancing progress in this area, we curate a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Linear Warmup With Linear Decay · Weight Decay · Attention Dropout · Linear Layer · Byte Pair Encoding · Adam · Residual Connection
