Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes
Antonio Carlos Rivera, Anthony Moore, Steven Robinson

TL;DR
This paper introduces VRAP, a retrieval-augmented prompting framework that significantly improves large vision-language models' ability to understand complex scenes with fine-grained reasoning, reducing hallucinations and inference latency.
Contribution
The paper presents a novel pipeline integrating retrieval-augmented object tags into large vision-language models, enhancing reasoning and efficiency in complex scene understanding.
Findings
Achieves state-of-the-art results on VQAv2, GQA, VizWiz, and COCO benchmarks.
Reduces inference latency by 40% through retrieval elimination.
Highlights the importance of retrieval-augmented tags and contrastive learning.
Abstract
Object-aware reasoning in vision-language tasks poses significant challenges for current models, particularly in handling unseen objects, reducing hallucinations, and capturing fine-grained relationships in complex visual scenes. To address these limitations, we propose the Vision-Aware Retrieval-Augmented Prompting (VRAP) framework, a generative approach that enhances Large Vision-Language Models (LVLMs) by integrating retrieval-augmented object tags into their prompts. VRAP introduces a novel pipeline where structured tags, including objects, attributes, and relationships, are extracted using pretrained visual encoders and scene graph parsers. These tags are enriched with external knowledge and incorporated into the LLM's input, enabling detailed and accurate reasoning. We evaluate VRAP across multiple vision-language benchmarks, including VQAv2, GQA, VizWiz, and COCO, achieving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
