LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment
Junyi Hu, Qiji Zhou, Lei Zhang, and Yue Zhang

TL;DR
LAGO is a novel framework for zero-shot visual-text alignment that adaptively focuses on relevant object regions guided by language, improving efficiency and robustness in fine-grained recognition tasks.
Contribution
LAGO introduces a two-stage process with object-centric candidate discovery and adaptive language-guided refinement, reducing inference cost and avoiding error amplification.
Findings
LAGO achieves state-of-the-art results on zero-shot benchmarks.
LAGO requires fewer candidate regions during inference.
LAGO demonstrates robustness under distribution shifts.
Abstract
Zero-shot recognition aims to classify an image by selecting the most compatible label description from a set of candidate classes without any task-specific supervision. In fine-grained settings, however, the relevant evidence often lies in localized parts, attributes, or textures rather than in the full image, making whole-image alignment suboptimal. Recent localized visual-text alignment methods address this by comparing class descriptions with multiple image regions, but they typically rely on large sets of random or redundant crops, increasing inference cost and introducing many highly redundant or weakly relevant candidates. Moreover, introducing semantic guidance too early can create an error-amplifying feedback process in which inaccurate intermediate predictions bias later localization and reinforce subsequent mistakes; we refer to this failure mode as the prediction loop. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
