RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations

I-Hsiang Chen; Yu-Wei Liu; Tse-Yu Wu; Yu-Chien Chiang; Jen-Chien Yang; Wei-Ting Chen

arXiv:2602.22013·cs.CV·March 27, 2026

RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations

I-Hsiang Chen, Yu-Wei Liu, Tse-Yu Wu, Yu-Chien Chiang, Jen-Chien Yang, Wei-Ting Chen

PDF

Open Access

TL;DR

RobustVisRAG is a causality-guided framework that enhances vision-based retrieval-augmented generation models' robustness to visual degradations by separating semantics from distortions, validated on a new comprehensive benchmark.

Contribution

The paper introduces RobustVisRAG, a novel causality-aware dual-path approach with new training objectives and a large-scale degraded dataset, significantly improving robustness in visual retrieval and generation tasks.

Findings

01

Improves retrieval, generation, and end-to-end performance under visual degradations.

02

Maintains accuracy on clean inputs.

03

Introduces the Distortion-VisRAG benchmark with diverse real-world distortions.

Abstract

Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence. However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages. To address this limitation, we introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization. RobustVisRAG uses a non-causal path to capture degradation signals through unidirectional attention and a causal path to learn purified semantics guided by these signals. Together with the proposed Non-Causal Distortion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Digital Humanities and Scholarship