VISA: Retrieval Augmented Generation with Visual Source Attribution
Xueguang Ma, Shengyao Zhuang, Bevan Koopman, Guido Zuccon, Wenhu Chen,, Jimmy Lin

TL;DR
VISA introduces a novel method combining answer generation with visual source attribution using vision-language models, enabling precise highlighting of evidence regions in documents for improved verifiability in retrieval-augmented generation systems.
Contribution
It proposes a new approach that integrates visual source attribution into RAG systems, utilizing large vision-language models to identify and highlight evidence regions in documents.
Findings
Effective visual source attribution demonstrated on Wikipedia and medical documents.
Highlights challenges and areas for future improvement in visual evidence localization.
Curated datasets for evaluating visual source attribution in RAG systems.
Abstract
Generation with source attribution is important for enhancing the verifiability of retrieval-augmented generation (RAG) systems. However, existing approaches in RAG primarily link generated content to document-level references, making it challenging for users to locate evidence among multiple content-rich retrieved documents. To address this challenge, we propose Retrieval-Augmented Generation with Visual Source Attribution (VISA), a novel approach that combines answer generation with visual source attribution. Leveraging large vision-language models (VLMs), VISA identifies the evidence and highlights the exact regions that support the generated answers with bounding boxes in the retrieved document screenshots. To evaluate its effectiveness, we curated two datasets: Wiki-VISA, based on crawled Wikipedia webpage screenshots, and Paper-VISA, derived from PubLayNet and tailored to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Multi-Head Attention · Residual Connection · Adam · Layer Normalization · Weight Decay · Softmax · WordPiece · Attention Dropout
