Preserve and Sculpt: Manifold-Aligned Fine-tuning of Vision-Language Models for Few-Shot Learning

Dexia Chen; Qianjie Zhu; Weibing Li; Yue Yu; Tong Zhang; Ruixuan Wang

arXiv:2508.12877·cs.CV·August 19, 2025

Preserve and Sculpt: Manifold-Aligned Fine-tuning of Vision-Language Models for Few-Shot Learning

Dexia Chen, Qianjie Zhu, Weibing Li, Yue Yu, Tong Zhang, Ruixuan Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces MPS-Tuning, a novel fine-tuning method for vision-language models that preserves the geometric structure of data manifolds to improve few-shot learning performance.

Contribution

MPS-Tuning explicitly constrains the semantic manifold's geometry during fine-tuning, aligning feature structures and enhancing class separability in vision-language models.

Findings

01

Significant performance improvements in few-shot image classification.

02

Effective preservation of the semantic data manifold structure.

03

Theoretical connection to Gromov-Wasserstein distance.

Abstract

Pretrained vision-language models (VLMs), such as CLIP, have shown remarkable potential in few-shot image classification and led to numerous effective transfer learning strategies. These methods leverage the pretrained knowledge of VLMs to enable effective domain adaptation while mitigating overfitting through parameter-efficient tuning or instance-based consistency constraints. However, such regularizations often neglect the geometric structure of data distribution, which may lead to distortion of the overall semantic representation. To overcome this limitation, we propose a novel fine-tuning method, Manifold-Preserving and Sculpting Tuning (MPS-Tuning). Regarding the data distribution in feature space as a semantic manifold, MPS-Tuning explicitly constrains the intrinsic geometry of this manifold while further sculpting it to enhance class separability. Specifically, MPS-Tuning…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The idea of preserving knowledge via manifold structure regularization is intuitive and well-motivated. - The proposed method achieves strong performance compared to the state-of-the-art VLM few-shot learning methods, especially when the number of training samples are large (e.g., 16 shots). - The paper is solid with extensive experiment results supporting the major claims regarding the effectiveness of MAR and HMS.

Weaknesses

- The method has its intrinsic limitation when the number of training samples are small: the Gram matrix is small and insufficient to capture the geometry. As a consequence, the 1-shot and 2-shot accuracies of MPS-Tuning are lower than SOTA methods on several datasets. The author is encouraged to discuss such limitation and potential solution when analyzing the results in figure 4. - In Table 2, the ablation study is only conducted under the 16-shot setting, which is insufficient to show the eff

Reviewer 02Rating 8Confidence 4

Strengths

Novel and Well-Motivated Conceptual Framework: The paper's primary strength is its conceptual shift from point-wise consistency to manifold-level consistency. The idea that preserving relationships between samples (in the Gram matrix) is more important than preserving the exact feature vector of each sample is intuitive and powerful. Strong Theoretical Grounding: The connection of the MAR loss to the Gromov-Wasserstein (GW) distance (Theorem 1, Appendix B) provides a solid theoretical foundatio

Weaknesses

The paper is very strong, and my points are primarily requests for clarification rather than major criticisms. 1. Justification of "Pseudo Forward" Mechanism: A key component of the "Hierarchical" sculpting is the pseudo-forward projection (Fig. 3, Eq. 10), which projects intermediate features to the output space by skipping the Attention modules but keeping the $V_{Proj}$ and $FFN$ layers. This is a very specific and unusual design. Q1: Could the authors provide more intuition or justificatio

Reviewer 03Rating 6Confidence 5

Strengths

1. The paper attempts to align the semantic spaces of images and texts from a manifold perspective, and designs two complementary modules — Manifold Alignment Regularization (MAR) and Hierarchical Manifold Sculpting (HMS) — to jointly balance knowledge preservation and adaptation to new domains. It also provides a theoretical foundation for semantic alignment under the Gromov–Wasserstein (GW) constraint. 2. The paper conducts extensive experiments across multiple datasets, along with comprehens

Weaknesses

1. Although the paper claims to constrain manifold alignment through the Gromov–Wasserstein (GW) distance, the metrics and formulations it employs do not appear to have an explicit mathematical correspondence to manifold geometry. Instead, the approach seems more accurately described as aligning semantic feature distributions rather than true manifolds. The use of the manifold concept thus appears somewhat overstretched. 2. Since the proposed GM alignment serves as an upper-bound approximation

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications