Hierarchical Contextual Grounding LVLM: Enhancing Fine-Grained Visual-Language Understanding with Robust Grounding
Leilei Guo, Antonio Carlos Rivera, Peiyu Tang, Haoxuan Ren, Zheyu Song

TL;DR
The paper introduces HCG-LVLM, a hierarchical model that improves fine-grained visual-language understanding by mimicking human coarse-to-fine processing, leading to more accurate and hallucination-free results.
Contribution
It proposes a novel hierarchical architecture with dual-layered perception and grounding modules, enhancing robustness and precision in visual-language tasks.
Findings
Outperforms state-of-the-art models on GQA, A-OKVQA, and RefCOCO datasets.
Reduces hallucination and improves accuracy in fine-grained visual reasoning.
Demonstrates robustness across multiple challenging multimodal datasets.
Abstract
Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) have achieved remarkable progress in natural language processing and multimodal understanding. Despite their impressive generalization capabilities, current LVLMs often exhibit insufficient robustness, proneness to hallucination, and reasoning errors in complex real-world scenarios, particularly when precise image region localization and fine-grained visual reasoning are required. To address these limitations, we propose the Hierarchical Contextual Grounding LVLM (HCG-LVLM), a novel architecture that mimics human coarse-to-fine cognitive processing. HCG-LVLM employs a two-layered approach: a Global Contextual Perception layer for initial broad understanding and a Fine-grained Local Grounding layer. The latter incorporates a Local Detail Enhancement Module to extract high-resolution features and a Semantic Consistency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
