CPT: Consistent Proxy Tuning for Black-box Optimization
Yuanyang He, Zitong Huang, Xinxing Xu, Rick Siow Mong Goh, Salman, Khan, Wangmeng Zuo, Yong Liu, Chun-Mei Feng

TL;DR
CPT introduces a consistent black-box tuning method that aligns training and testing objectives by leveraging frozen models, significantly improving performance in language and vision-language models across diverse datasets.
Contribution
This paper proposes CPT, a novel black-box tuning approach that ensures training-test consistency using frozen models, enhancing performance over existing proxy-tuning methods.
Findings
CPT outperforms proxy-tuning in LLMs and VLMs across multiple datasets.
The method is model-agnostic and logit-level, applicable to various tasks.
Experimental results demonstrate significant performance improvements.
Abstract
Black-box tuning has attracted recent attention due to that the structure or inner parameters of advanced proprietary models are not accessible. Proxy-tuning provides a test-time output adjustment for tuning black-box language models. It applies the difference of the output logits before and after tuning a smaller white-box "proxy" model to improve the black-box model. However, this technique serves only as a decoding-time algorithm, leading to an inconsistency between training and testing which potentially limits overall performance. To address this problem, we introduce Consistent Proxy Tuning (CPT), a simple yet effective black-box tuning method. Different from Proxy-tuning, CPT additionally exploits the frozen large black-box model and another frozen small white-box model, ensuring consistency between training-stage optimization objective and test-time proxies. This consistency…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The proposed method is elegant, simple yet effective and beneficial to practitioners. 2. The presentation is clean and easy to follow. 3. The proposed method is extended to VL models to further demonstrate its effectiveness.
1. Lack of analysis over why proxy tuning works: This is a big ask, but I hope the authors can give some intuitions here. In addition, if the underlying assumption is that both small and base model can access the output logits for the entire vocabulary, then how is the inconsistency defined. Can you elaborate more? How about in CLIP's case. 2. Increased actual cost: Although not requiring more computation during training, there is a need to compute logits using the black box model for each trai
1. The paper is well-written, with clear logic that is easy to follow. --- 2. The proposed method is clear, well-motivated, and demonstrates simplicity while being effective. --- 3. The method uses the output logits of the large model as a reference. By incorporating these logits into the loss function of the small model, the training objective becomes closer to the ensemble effect used during inference. This acts as an implicit regularization on the small model during training, ensuring its o
1. Although the method is effective, it requires significant computational resources, especially during training when additional model outputs are generated. This could limit its practicality for larger-scale models without high-resource infrastructure. --- 2. The impact of parameters like $\alpha_{\text{train}} $ and $ \alpha_{\text{test}}$ on performance, shown in Figure 2, indicates that careful tuning is necessary, which may complicate implementation. --- 3. The motivation states, “Such in
1. The paper is easy to follow. 2. The proposed method is straightforward and easy to implement. 3. Extensive experiments are performed with language models and vision-language models. Actual training time is also reported.
1. The novelty of the proposed method is limited. It just directly optimizes the small proxy model in proxy-tuning to achieve more accuracy, which is a trivial remedy and even hurts some advantages of proxy-tuning, i.e., it introduces additional memory/computational cost and requires the availability/accessibility to the black-box model during optimization. 2. The gains of consistent proxy tuning in accuracy seem too little against the vanilla proxy-tuning in most cases, relative to zero-shot an
* The considered setting of black-box tuning is important and is of interest of a broad community given that currently the most capable models are close-weight models.
* The methodological contribution is the very minor addition to the original method [1]. The obtained quantitive results also demonstrate minor improvements upon the original Proxy Tuning. * Big part of the experiments for vision tasks is done for CLIP-like models. The considered setting of black-box fine-tuning of CLIP-like models is not well-justified since to the best of my knowledge there are no closed-source CLIP-like VLMs. I strongly encourage authors to consider the setting that is close
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Digital Filter Design and Implementation · Embedded Systems Design Techniques
MethodsSoftmax · Attention Is All You Need
