Leveraging Retrieval-Augmented Tags for Large Vision-Language   Understanding in Complex Scenes

Antonio Carlos Rivera; Anthony Moore; Steven Robinson

arXiv:2412.11396·cs.CV·December 17, 2024

Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes

Antonio Carlos Rivera, Anthony Moore, Steven Robinson

PDF

Open Access

TL;DR

This paper introduces VRAP, a retrieval-augmented prompting framework that significantly improves large vision-language models' ability to understand complex scenes with fine-grained reasoning, reducing hallucinations and inference latency.

Contribution

The paper presents a novel pipeline integrating retrieval-augmented object tags into large vision-language models, enhancing reasoning and efficiency in complex scene understanding.

Findings

01

Achieves state-of-the-art results on VQAv2, GQA, VizWiz, and COCO benchmarks.

02

Reduces inference latency by 40% through retrieval elimination.

03

Highlights the importance of retrieval-augmented tags and contrastive learning.

Abstract

Object-aware reasoning in vision-language tasks poses significant challenges for current models, particularly in handling unseen objects, reducing hallucinations, and capturing fine-grained relationships in complex visual scenes. To address these limitations, we propose the Vision-Aware Retrieval-Augmented Prompting (VRAP) framework, a generative approach that enhances Large Vision-Language Models (LVLMs) by integrating retrieval-augmented object tags into their prompts. VRAP introduces a novel pipeline where structured tags, including objects, attributes, and relationships, are extracted using pretrained visual encoders and scene graph parsers. These tags are enriched with external knowledge and incorporated into the LLM's input, enabling detailed and accurate reasoning. We evaluate VRAP across multiple vision-language benchmarks, including VQAv2, GQA, VizWiz, and COCO, achieving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications