LGCA: Enhancing Semantic Representation via Progressive Expansion
Thanh Hieu Cao, Trung Khang Tran, Gia Thinh Pham, Tuong Nghiem Diep, and Thanh Binh Nguyen

TL;DR
LGCA is a novel framework that enhances semantic image-text alignment by capturing local features and expanding salient regions, improving zero-shot classification performance while maintaining efficiency.
Contribution
The paper introduces LGCA, a method that combines local feature capture and region expansion to improve vision-language model alignment without increasing computational complexity.
Findings
Significant improvement in zero-shot classification accuracy.
Outperforms state-of-the-art baselines on multiple datasets.
Maintains the same time complexity as the original model.
Abstract
Recent advancements in large-scale pretraining in natural language processing have enabled pretrained vision-language models such as CLIP to effectively align images and text, significantly improving performance in zero-shot image classification tasks. Subsequent studies have further demonstrated that cropping images into smaller regions and using large language models to generate multiple descriptions for each caption can further enhance model performance. However, due to the inherent sensitivity of CLIP, random image crops can introduce misinformation and bias, as many images share similar features at small scales. To address this issue, we propose Localized-Globalized Cross-Alignment (LGCA), a framework that first captures the local features of an image and then repeatedly selects the most salient regions and expands them. The similarity score is designed to incorporate both the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
