Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker

Rachna Saxena; Abhijeet Kumar; Suresh Shanmugam

arXiv:2507.12378·cs.IR·July 17, 2025

Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker

Rachna Saxena, Abhijeet Kumar, Suresh Shanmugam

PDF

Open Access 1 Repo

TL;DR

This paper presents a scalable and efficient multimodal retrieval system for vision-augmented Q&A, combining hybrid search and late interaction re-ranking to improve speed and stability without sacrificing accuracy.

Contribution

It introduces a multi-step hybrid search approach with late interaction re-ranking to enhance large-scale multimodal retrieval for enterprise-grade Q&A systems.

Findings

01

Significant speed-up in retrieval process.

02

Stable performance without quality degradation.

03

Suitable for production deployment in enterprises.

Abstract

Traditional information extraction systems face challenges with text only language models as it does not consider infographics (visual elements of information) such as tables, charts, images etc. often used to convey complex information to readers. Multimodal LLM (MLLM) face challenges of finding needle in the haystack problem i.e., either longer context length or substantial number of documents as search space. Late interaction mechanism over visual language models has shown state of the art performance in retrieval-based vision augmented Q&A tasks. There are yet few challenges using it for RAG based multi-modal Q&A. Firstly, many popular and widely adopted vector databases do not support native multi-vector retrieval. Secondly, late interaction requires computation which inflates space footprint and can hinder enterprise adoption. Lastly, the current state of late interaction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

abhijeet3922/vision-RAG
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsDropout · BERT · BART · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · RAG