FastV-RAG: Towards Fast and Fine-Grained Video QA with Retrieval-Augmented Generation
Gen Li, Peiyu Liu

TL;DR
VideoSpeculateRAG introduces a novel, efficient retrieval-augmented video question answering framework that combines speculative decoding with similarity filtering, significantly improving speed and accuracy in knowledge-intensive tasks.
Contribution
The paper presents a new VLM-based RAG framework that reduces inference latency and improves answer accuracy through speculative decoding and entity filtering strategies.
Findings
Achieves comparable or higher accuracy than standard RAG methods.
Speeds up inference by approximately 2x.
Effectively mitigates entity recognition errors in retrieved knowledge.
Abstract
Vision-Language Models (VLMs) excel at visual reasoning but still struggle with integrating external knowledge. Retrieval-Augmented Generation (RAG) is a promising solution, but current methods remain inefficient and often fail to maintain high answer quality. To address these challenges, we propose VideoSpeculateRAG, an efficient VLM-based RAG framework built on two key ideas. First, we introduce a speculative decoding pipeline: a lightweight draft model quickly generates multiple answer candidates, which are then verified and refined by a more accurate heavyweight model, substantially reducing inference latency without sacrificing correctness. Second, we identify a major source of error - incorrect entity recognition in retrieved knowledge - and mitigate it with a simple yet effective similarity-based filtering strategy that improves entity alignment and boosts overall answer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
