Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering
Junxiao Xue, Quan Deng, Fei Yu, Yanhao Wang, Jun Wang and, Yuehua Li

TL;DR
This paper introduces a multimodal retrieval-augmented generation framework with scene graphs to improve object recognition, spatial understanding, and accuracy in visual question answering, especially in complex scenes.
Contribution
The novel framework integrates structured scene graphs into MLLMs, enhancing their ability to recognize objects and understand spatial relationships in challenging visual contexts.
Findings
Outperforms existing MLLMs in VQA accuracy
Improves object localization and counting in complex scenes
Achieves better results on VG-150 and AUG datasets
Abstract
Multimodal large language models (MLLMs), such as GPT-4o, Gemini, LLaVA, and Flamingo, have made significant progress in integrating visual and textual modalities, excelling in tasks like visual question answering (VQA), image captioning, and content retrieval. They can generate coherent and contextually relevant descriptions of images. However, they still face challenges in accurately identifying and counting objects and determining their spatial locations, particularly in complex scenes with overlapping or small objects. To address these limitations, we propose a novel framework based on multimodal retrieval-augmented generation (RAG), which introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images. Our framework improves the MLLM's capacity to handle tasks requiring precise visual descriptions, especially in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
