FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA
Nobin Sarwar

TL;DR
FilterRAG is a retrieval-augmented framework that significantly reduces hallucinations and enhances robustness in Visual Question Answering by grounding answers in external knowledge sources.
Contribution
It introduces FilterRAG, a novel approach combining BLIP-VQA with retrieval-augmented generation to improve factual accuracy in VQA models.
Findings
Achieves 36.5% accuracy on OK-VQA dataset.
Reduces hallucinations in knowledge-driven VQA.
Improves robustness in Out-of-Distribution scenarios.
Abstract
Visual Question Answering requires models to generate accurate answers by integrating visual and textual understanding. However, VQA models still struggle with hallucinations, producing convincing but incorrect answers, particularly in knowledge-driven and Out-of-Distribution scenarios. We introduce FilterRAG, a retrieval-augmented framework that combines BLIP-VQA with Retrieval-Augmented Generation to ground answers in external knowledge sources like Wikipedia and DBpedia. FilterRAG achieves 36.5% accuracy on the OK-VQA dataset, demonstrating its effectiveness in reducing hallucinations and improving robustness in both in-domain and Out-of-Distribution settings. These findings highlight the potential of FilterRAG to improve Visual Question Answering systems for real-world deployment.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Graph Neural Networks
