Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering

Junxiao Xue; Quan Deng; Fei Yu; Yanhao Wang; Jun Wang and; Yuehua Li

arXiv:2412.20927·cs.CV·December 31, 2024·2 cites

Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering

Junxiao Xue, Quan Deng, Fei Yu, Yanhao Wang, Jun Wang and, Yuehua Li

PDF

Open Access

TL;DR

This paper introduces a multimodal retrieval-augmented generation framework with scene graphs to improve object recognition, spatial understanding, and accuracy in visual question answering, especially in complex scenes.

Contribution

The novel framework integrates structured scene graphs into MLLMs, enhancing their ability to recognize objects and understand spatial relationships in challenging visual contexts.

Findings

01

Outperforms existing MLLMs in VQA accuracy

02

Improves object localization and counting in complex scenes

03

Achieves better results on VG-150 and AUG datasets

Abstract

Multimodal large language models (MLLMs), such as GPT-4o, Gemini, LLaVA, and Flamingo, have made significant progress in integrating visual and textual modalities, excelling in tasks like visual question answering (VQA), image captioning, and content retrieval. They can generate coherent and contextually relevant descriptions of images. However, they still face challenges in accurately identifying and counting objects and determining their spatial locations, particularly in complex scenes with overlapping or small objects. To address these limitations, we propose a novel framework based on multimodal retrieval-augmented generation (RAG), which introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images. Our framework improves the MLLM's capacity to handle tasks requiring precise visual descriptions, especially in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning