Model Diffusion for Certifiable Few-shot Transfer Learning
Fady Rezk, Royson Lee, Henry Gouk, Timothy Hospedales, Minyoung Kim

TL;DR
This paper introduces a novel transfer learning method that uses diffusion models to generate a finite set of parameter-efficient fine-tuning options, enabling certifiable generalization guarantees in low-data scenarios.
Contribution
It develops a diffusion-based approach for transfer learning that provides non-vacuous theoretical generalization guarantees in low-shot settings, unlike traditional methods.
Findings
Provides tighter risk bounds compared to existing approaches.
Demonstrates non-trivial generalization guarantees in low-shot transfer learning.
Uses a finite set of PEFT samples for certifiable learning.
Abstract
In contemporary deep learning, a prevalent and effective workflow for solving low-data problems is adapting powerful pre-trained foundation models (FMs) to new tasks via parameter-efficient fine-tuning (PEFT). However, while empirically effective, the resulting solutions lack generalisation guarantees to certify their accuracy - which may be required for ethical or legal reasons prior to deployment in high-importance applications. In this paper we develop a novel transfer learning approach that is designed to facilitate non-vacuous learning theoretic generalisation guarantees for downstream tasks, even in the low-shot regime. Specifically, we first use upstream tasks to train a distribution over PEFT parameters. We then learn the downstream task by a sample-and-evaluate procedure -- sampling plausible PEFTs from the trained diffusion model and selecting the one with the highest…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper is well written and straightforwward to follow; while STEEL itself is sensible - you utilize a diffusion model as a weight candidate generator for your test-time PEFT in order to provide tighter (and actual) generalization bounds. The combination of PEFT hypernetwork and generalization bounds is, to the best of my knowledge, novel - and sensible, with convincing results in 5.1 and 5.2 on both LLM and Vision-model adaptation.
I do think this paper provides a very interesting use of PEFT hypernetworks for generalization bounds, which to me comes with one major question / issue: The whole approach hinges on L. 193: "We expect that Θ is rich enough to represent the true task distribution ptrue(T) faithfully, and the adapted (“selected”) θ will generalize well on unseen samples from T", which is a very, very strong statement to make. By default, STEEL is likely much more limited when it comes to adapting to larger distr
- A neat combination of weight-space generative modeling (DDPM) to form a finite hypothesis set and evaluate–then–select to keep the complexity term fixed while minimizing empirical risk—yielding non-vacuous certificates in few-shot regimes. - Experimental breadth & rigor. Evaluations span multiple LaMP tasks and vision datasets under standard meta-learning protocols, reporting not only accuracy but also % non-vacuous, bound statistics, and gap. - The paper visualizes how certification varies wi
(W1) The paper fixes LoRA-XS (~2.6K params) and CoOp (1,024-dim prompt) without varying LoRA rank or token count. Given both diffusion learnability and certification can depend on $\dim(\theta)$, a $\theta$-size sweep would strengthen the claims. (W2) Fairness to model-zoo under matched search. Hierarchical search is an inference strategy, not unique to STEEL. A controlled comparison where model-zoo also uses the same k-means/medoid + top-15 pipeline would isolate the benefit of diffusion sampl
The paper provides some theoretical analysis, and the experiments indicate improved bounds and non-trivial generalization guarantees.
1. The title in the submitted PDF differs from the one on OpenReview, which raises concerns. 2. The main contribution appears to be learning a parameter diffusion model to generate PEFTs according to the task distribution. However, the architecture and training of the diffusion model are unclear. Section 3.3 discusses its role only at a high level, stating it is trained on PEFT parameters {θi}, but omits key architectural details and training objectives. The paper seems to assume readers are alr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning
MethodsDiffusion · Sparse Evolutionary Training
