Progressive Multi-granular Alignments for Grounded Reasoning in Large   Vision-Language Models

Quang-Hung Le; Long Hoang Dang; Ngan Le; Truyen Tran; Thao Minh Le

arXiv:2412.08125·cs.CV·December 20, 2024

Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models

Quang-Hung Le, Long Hoang Dang, Ngan Le, Truyen Tran, Thao Minh Le

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents PromViL, a hierarchical framework that improves large vision-language models' ability to perform grounded compositional reasoning by progressively aligning multi-modal concepts from simple to complex.

Contribution

Introduction of a hierarchical multi-granular alignment framework and a novel dataset for enhancing compositional visual reasoning in LVLMs.

Findings

01

Significant improvements on visual grounding tasks

02

Enhanced performance on compositional question answering

03

Effective hierarchical alignment strategy

Abstract

Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-modal inputs but struggle with compositional concepts and high-level relationships between entities. This paper introduces Progressive multi-granular Vision-Language alignments (PromViL), a novel framework to enhance LVLMs' ability in performing grounded compositional visual reasoning tasks. Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts. By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning. To facilitate this learning process, we introduce a data generation process that creates a novel dataset derived from Visual Genome, providing a wide range of nested compositional vision-language pairs. Experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lqh52/promvil
pytorchOfficial

Videos

Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling