AeroRAG: Structured Multimodal Retrieval-Augmented LLM for Fine-Grained Aerial Visual Reasoning
Junxiao Xue, Quan Deng, Tingqi Hu, Meicong Si, Xinyi Yin, Yunyun Shi, Xuecheng Wu

TL;DR
AeroRAG introduces a scene-graph-guided multimodal retrieval-augmented framework that enhances aerial visual question answering by explicitly structuring visual knowledge for better reasoning.
Contribution
The paper presents AeroRAG, a novel approach that converts aerial images into structured visual knowledge and uses retrieval-augmented prompts for improved reasoning in large language models.
Findings
Significant performance improvements on AUG and VG-150 datasets.
Largest gains observed in dense aerial scenes and relation-sensitive reasoning.
Framework remains compatible with standard visual reasoning benchmarks like VQAv2.
Abstract
Despite recent progress in multimodal large language models (MLLMs), reliable visual question answering in aerial scenes remains challenging. In such scenes, task-critical evidence is often carried by small objects, explicit quantities, coarse locations, and inter-object relations, whereas conventional dense visual-token representations are not well aligned with these structured semantics. To address this interface mismatch, we propose AeroRAG, a scene-graph-guided multimodal retrieval-augmented generation framework for visual question answering. The framework first converts an input image into structured visual knowledge, including object categories, quantities, spatial locations, and semantic relations, and then retrieves query-relevant semantic chunks to construct compact prompts for a text-based large language model. Rather than relying on direct reasoning over dense visual tokens,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
