VISA: Retrieval Augmented Generation with Visual Source Attribution

Xueguang Ma; Shengyao Zhuang; Bevan Koopman; Guido Zuccon; Wenhu Chen,; Jimmy Lin

arXiv:2412.14457·cs.IR·December 20, 2024

VISA: Retrieval Augmented Generation with Visual Source Attribution

Xueguang Ma, Shengyao Zhuang, Bevan Koopman, Guido Zuccon, Wenhu Chen,, Jimmy Lin

PDF

Open Access 1 Video

TL;DR

VISA introduces a novel method combining answer generation with visual source attribution using vision-language models, enabling precise highlighting of evidence regions in documents for improved verifiability in retrieval-augmented generation systems.

Contribution

It proposes a new approach that integrates visual source attribution into RAG systems, utilizing large vision-language models to identify and highlight evidence regions in documents.

Findings

01

Effective visual source attribution demonstrated on Wikipedia and medical documents.

02

Highlights challenges and areas for future improvement in visual evidence localization.

03

Curated datasets for evaluating visual source attribution in RAG systems.

Abstract

Generation with source attribution is important for enhancing the verifiability of retrieval-augmented generation (RAG) systems. However, existing approaches in RAG primarily link generated content to document-level references, making it challenging for users to locate evidence among multiple content-rich retrieved documents. To address this challenge, we propose Retrieval-Augmented Generation with Visual Source Attribution (VISA), a novel approach that combines answer generation with visual source attribution. Leveraging large vision-language models (VLMs), VISA identifies the evidence and highlights the exact regions that support the generated answers with bounding boxes in the retrieved document screenshots. To evaluate its effectiveness, we curated two datasets: Wiki-VISA, based on crawled Wikipedia webpage screenshots, and Paper-VISA, derived from PubLayNet and tailored to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VISA: Retrieval Augmented Generation with Visual Source Attribution· underline

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Multi-Head Attention · Residual Connection · Adam · Layer Normalization · Weight Decay · Softmax · WordPiece · Attention Dropout