Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker
Rachna Saxena, Abhijeet Kumar, Suresh Shanmugam

TL;DR
This paper presents a scalable and efficient multimodal retrieval system for vision-augmented Q&A, combining hybrid search and late interaction re-ranking to improve speed and stability without sacrificing accuracy.
Contribution
It introduces a multi-step hybrid search approach with late interaction re-ranking to enhance large-scale multimodal retrieval for enterprise-grade Q&A systems.
Findings
Significant speed-up in retrieval process.
Stable performance without quality degradation.
Suitable for production deployment in enterprises.
Abstract
Traditional information extraction systems face challenges with text only language models as it does not consider infographics (visual elements of information) such as tables, charts, images etc. often used to convey complex information to readers. Multimodal LLM (MLLM) face challenges of finding needle in the haystack problem i.e., either longer context length or substantial number of documents as search space. Late interaction mechanism over visual language models has shown state of the art performance in retrieval-based vision augmented Q&A tasks. There are yet few challenges using it for RAG based multi-modal Q&A. Firstly, many popular and widely adopted vector databases do not support native multi-vector retrieval. Secondly, late interaction requires computation which inflates space footprint and can hinder enterprise adoption. Lastly, the current state of late interaction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
MethodsDropout · BERT · BART · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · RAG
