KIRA: Knowledge-Intensive Image Retrieval and Reasoning Architecture for Specialized Visual Domains
Parthaw Goswami, Jaynto Goswami Deep

TL;DR
KIRA is a comprehensive framework for visual retrieval augmented generation in specialized domains, addressing challenges in multimodal knowledge base construction, multihop reasoning, and faithful answer grounding.
Contribution
It introduces a five-stage architecture with novel techniques for semantic chunking, domain adaptation, crossmodal retrieval, multihop reasoning, and grounded generation, plus a new benchmark suite.
Findings
Achieves 0.97 retrieval precision and 1.0 grounding scores across domains.
Demonstrates effectiveness in medical, circuit, satellite, and histopathology images.
Ablation studies reveal component contributions and tradeoffs.
Abstract
Retrieval augmented generation (RAG) has transformed text based question answering, yet its extension to visual domains remains hindered by fundamental challenges: bridging the modality gap between image queries and text heavy knowledge bases, constructing semantically meaningful visual knowledge bases, performing multihop reasoning over retrieved images, and verifying that generated answers are faithfully grounded in visual evidence. We present KIRA (Knowledge Intensive Image Retrieval and Reasoning Architecture), a unified five stage framework that addresses ten core problems in visual RAG for specialized domains. KIRA introduces: (1) hierarchical semantic chunking with DINO based region detection for multi granularity knowledge base construction, (2) domain adaptive contrastive encoders with fewshot adaptation for rare visual concepts, (3) dualpath crossmodal retrieval with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
