Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection

Jingyao Wang; Yiming Chen; Lingyu Si; Changwen Zheng

arXiv:2507.13061·cs.CV·October 21, 2025

Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection

Jingyao Wang, Yiming Chen, Lingyu Si, Changwen Zheng

PDF

TL;DR

This paper introduces a Hierarchical Coresets Selection mechanism that enhances Vision-Language Models' ability to understand complex wide-area scenes efficiently without additional fine-tuning.

Contribution

It proposes a theoretically grounded, plug-and-play selection method that improves VLM adaptation to unseen scenes at any scale with minimal regions.

Findings

01

HCS improves scene understanding accuracy across various tasks.

02

HCS enables rapid adaptation to unseen complex scenes.

03

HCS is compatible with any VLM and requires no extra fine-tuning.

Abstract

Scene understanding is one of the core tasks in computer vision, aiming to extract semantic information from images to identify objects, scene categories, and their interrelationships. Although advancements in Vision-Language Models (VLMs) have driven progress in this field, existing VLMs still face challenges in adaptation to unseen complex wide-area scenes. To address the challenges, this paper proposes a Hierarchical Coresets Selection (HCS) mechanism to advance the adaptation of VLMs in complex wide-area scene understanding. It progressively refines the selected regions based on the proposed theoretically guaranteed importance function, which considers utility, representativeness, robustness, and synergy. Without requiring additional fine-tuning, HCS enables VLMs to achieve rapid understandings of unseen scenes at any scale using minimal interpretable regions while mitigating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.