Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG
Baiyu Chen, Wilson Wongso, Xiaoqian Hu, Yue Tan, Flora Salim

TL;DR
This paper introduces a multi-stage verification framework for multi-modal RAG systems that significantly reduces hallucinations and improves factual accuracy, demonstrated by achieving third place in a competitive benchmark.
Contribution
The paper presents a novel multi-stage framework with a query router, retrieval, summarization, and verification to mitigate hallucinations in multi-modal RAG models.
Findings
Achieved 3rd place in KDD Cup 2025 Meta CRAG-MM challenge.
Demonstrated reduced hallucinations and improved factual accuracy.
Validated effectiveness of multi-stage verification in complex multi-modal tasks.
Abstract
This paper presents the technical solution developed by team CRUISE for the KDD Cup 2025 Meta Comprehensive RAG Benchmark for Multi-modal, Multi-turn (CRAG-MM) challenge. The challenge aims to address a critical limitation of modern Vision Language Models (VLMs): their propensity to hallucinate, especially when faced with egocentric imagery, long-tail entities, and complex, multi-hop questions. This issue is particularly problematic in real-world applications where users pose fact-seeking queries that demand high factual accuracy across diverse modalities. To tackle this, we propose a robust, multi-stage framework that prioritizes factual accuracy and truthfulness over completeness. Our solution integrates a lightweight query router for efficiency, a query-aware retrieval and summarization pipeline, a dual-pathways generation and a post-hoc verification. This conservative strategy is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
