KIRA: Knowledge-Intensive Image Retrieval and Reasoning Architecture for Specialized Visual Domains

Parthaw Goswami; Jaynto Goswami Deep

arXiv:2604.16915·cs.CV·April 27, 2026

KIRA: Knowledge-Intensive Image Retrieval and Reasoning Architecture for Specialized Visual Domains

Parthaw Goswami, Jaynto Goswami Deep

PDF

TL;DR

KIRA is a comprehensive framework for visual retrieval augmented generation in specialized domains, addressing challenges in multimodal knowledge base construction, multihop reasoning, and faithful answer grounding.

Contribution

It introduces a five-stage architecture with novel techniques for semantic chunking, domain adaptation, crossmodal retrieval, multihop reasoning, and grounded generation, plus a new benchmark suite.

Findings

01

Achieves 0.97 retrieval precision and 1.0 grounding scores across domains.

02

Demonstrates effectiveness in medical, circuit, satellite, and histopathology images.

03

Ablation studies reveal component contributions and tradeoffs.

Abstract

Retrieval augmented generation (RAG) has transformed text based question answering, yet its extension to visual domains remains hindered by fundamental challenges: bridging the modality gap between image queries and text heavy knowledge bases, constructing semantically meaningful visual knowledge bases, performing multihop reasoning over retrieved images, and verifying that generated answers are faithfully grounded in visual evidence. We present KIRA (Knowledge Intensive Image Retrieval and Reasoning Architecture), a unified five stage framework that addresses ten core problems in visual RAG for specialized domains. KIRA introduces: (1) hierarchical semantic chunking with DINO based region detection for multi granularity knowledge base construction, (2) domain adaptive contrastive encoders with fewshot adaptation for rare visual concepts, (3) dualpath crossmodal retrieval with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.