Feature Projection Learning for Better Vision-Language Reasoning
Yi Zhang, Weicheng Lin, Liang-Jie Zhang

TL;DR
This paper introduces Feature Projection Learning (FPL), a simple and efficient method that enhances vision-language reasoning by projecting class features into image feature space, improving accuracy over existing methods.
Contribution
The paper proposes FPL, a novel projection-based approach that improves downstream task adaptation of CLIP with better performance and efficiency.
Findings
FPL surpasses state-of-the-art methods in accuracy.
FPL effectively transforms classification into a feature projection problem.
Empirical results demonstrate FPL's superior performance.
Abstract
Vision-Language Pre-Trained models, notably CLIP, that utilize contrastive learning have proven highly adept at extracting generalizable visual features. To inherit the well-learned knowledge of VLP models for downstream tasks, several approaches aim to adapt them efficiently with limited supervision. However, these methods either suffer from limited performance, excessive learnable parameters, or extended training times, all of which hinder their effectiveness in adapting the CLIP model to downstream tasks. In this work, we propose a simple yet efficient and effective method called \textit{\textbf{F}eature \textbf{P}rojection \textbf{L}earning(FPL)} to address these problems. Specifically, we develop a projection model that projects class prototype features into the query image feature space and reconstructs the query image feature map. The negative average squared reconstruction error…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
