Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection
Jingyao Wang, Yiming Chen, Lingyu Si, Changwen Zheng

TL;DR
This paper introduces a Hierarchical Coresets Selection mechanism that enhances Vision-Language Models' ability to understand complex wide-area scenes efficiently without additional fine-tuning.
Contribution
It proposes a theoretically grounded, plug-and-play selection method that improves VLM adaptation to unseen scenes at any scale with minimal regions.
Findings
HCS improves scene understanding accuracy across various tasks.
HCS enables rapid adaptation to unseen complex scenes.
HCS is compatible with any VLM and requires no extra fine-tuning.
Abstract
Scene understanding is one of the core tasks in computer vision, aiming to extract semantic information from images to identify objects, scene categories, and their interrelationships. Although advancements in Vision-Language Models (VLMs) have driven progress in this field, existing VLMs still face challenges in adaptation to unseen complex wide-area scenes. To address the challenges, this paper proposes a Hierarchical Coresets Selection (HCS) mechanism to advance the adaptation of VLMs in complex wide-area scene understanding. It progressively refines the selected regions based on the proposed theoretically guaranteed importance function, which considers utility, representativeness, robustness, and synergy. Without requiring additional fine-tuning, HCS enables VLMs to achieve rapid understandings of unseen scenes at any scale using minimal interpretable regions while mitigating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
