FastV-RAG: Towards Fast and Fine-Grained Video QA with Retrieval-Augmented Generation

Gen Li; Peiyu Liu

arXiv:2601.01513·cs.CV·January 8, 2026

FastV-RAG: Towards Fast and Fine-Grained Video QA with Retrieval-Augmented Generation

Gen Li, Peiyu Liu

PDF

Open Access

TL;DR

VideoSpeculateRAG introduces a novel, efficient retrieval-augmented video question answering framework that combines speculative decoding with similarity filtering, significantly improving speed and accuracy in knowledge-intensive tasks.

Contribution

The paper presents a new VLM-based RAG framework that reduces inference latency and improves answer accuracy through speculative decoding and entity filtering strategies.

Findings

01

Achieves comparable or higher accuracy than standard RAG methods.

02

Speeds up inference by approximately 2x.

03

Effectively mitigates entity recognition errors in retrieved knowledge.

Abstract

Vision-Language Models (VLMs) excel at visual reasoning but still struggle with integrating external knowledge. Retrieval-Augmented Generation (RAG) is a promising solution, but current methods remain inefficient and often fail to maintain high answer quality. To address these challenges, we propose VideoSpeculateRAG, an efficient VLM-based RAG framework built on two key ideas. First, we introduce a speculative decoding pipeline: a lightweight draft model quickly generates multiple answer candidates, which are then verified and refined by a more accurate heavyweight model, substantially reducing inference latency without sacrificing correctness. Second, we identify a major source of error - incorrect entity recognition in retrieved knowledge - and mitigate it with a simple yet effective similarity-based filtering strategy that improves entity alignment and boosts overall answer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning