AeroRAG: Structured Multimodal Retrieval-Augmented LLM for Fine-Grained Aerial Visual Reasoning

Junxiao Xue; Quan Deng; Tingqi Hu; Meicong Si; Xinyi Yin; Yunyun Shi; Xuecheng Wu

arXiv:2604.17889·cs.CV·April 21, 2026

AeroRAG: Structured Multimodal Retrieval-Augmented LLM for Fine-Grained Aerial Visual Reasoning

Junxiao Xue, Quan Deng, Tingqi Hu, Meicong Si, Xinyi Yin, Yunyun Shi, Xuecheng Wu

PDF

TL;DR

AeroRAG introduces a scene-graph-guided multimodal retrieval-augmented framework that enhances aerial visual question answering by explicitly structuring visual knowledge for better reasoning.

Contribution

The paper presents AeroRAG, a novel approach that converts aerial images into structured visual knowledge and uses retrieval-augmented prompts for improved reasoning in large language models.

Findings

01

Significant performance improvements on AUG and VG-150 datasets.

02

Largest gains observed in dense aerial scenes and relation-sensitive reasoning.

03

Framework remains compatible with standard visual reasoning benchmarks like VQAv2.

Abstract

Despite recent progress in multimodal large language models (MLLMs), reliable visual question answering in aerial scenes remains challenging. In such scenes, task-critical evidence is often carried by small objects, explicit quantities, coarse locations, and inter-object relations, whereas conventional dense visual-token representations are not well aligned with these structured semantics. To address this interface mismatch, we propose AeroRAG, a scene-graph-guided multimodal retrieval-augmented generation framework for visual question answering. The framework first converts an input image into structured visual knowledge, including object categories, quantities, spatial locations, and semantic relations, and then retrieves query-relevant semantic chunks to construct compact prompts for a text-based large language model. Rather than relying on direct reasoning over dense visual tokens,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.