LPT: Less-overfitting Prompt Tuning for Vision-Language Model

Chenhao Ding; Xinyuan Gao; Songlin Dong; Jizhou Han; Qiang Wang; Zhengdong Zhou; Yuhang He; Yihong Gong

arXiv:2410.10247·cs.CV·May 12, 2026

LPT: Less-overfitting Prompt Tuning for Vision-Language Model

Chenhao Ding, Xinyuan Gao, Songlin Dong, Jizhou Han, Qiang Wang, Zhengdong Zhou, Yuhang He, Yihong Gong

PDF

TL;DR

This paper introduces LPT, a prompt tuning framework for vision-language models that reduces overfitting and enhances generalization by filtering visual information, preserving feature structure, and constraining output class information.

Contribution

LPT employs CLIP-guided filtering, structural preservation, and hierarchical logit constraints to effectively mitigate overfitting in prompt tuning for VLMs, improving transfer performance.

Findings

01

LPT significantly outperforms state-of-the-art methods on various benchmarks.

02

The approach enhances generalization in cross-dataset and domain transfer tasks.

03

Structural preservation and hierarchical logit constraints effectively reduce overfitting.

Abstract

Vision-language models (VLMs) have demonstrated exceptional generalization capabilities for downstream tasks. Due to its efficiency, prompt learning has gradually become a more effective and efficient method for transferring VLMs to downstream tasks, surpassing traditional finetuning methods. However, during the transfer process, these models are prone to severe overfitting, leading to a significant decline in generalization ability. To address this issue, we propose a framework named LPT, specifically designed for vision-language models. Specifically, we use CLIP to filter out fine-grained foreground information that may lead to overfitting, thereby guiding the prompts with basic visual concepts. Additionally, to further mitigate overfitting, we have developed a Structural Preservation (SP) constraint at the feature level, which aligns the model's overall feature space structure with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.