Task Priors: Enhancing Model Evaluation by Considering the Entire Space of Downstream Tasks
Niket Patel, Randall Balestriero

TL;DR
This paper introduces a probabilistic framework called Task Priors for evaluating AI models across the entire space of possible downstream tasks, moving beyond fixed benchmark sets to provide a more comprehensive performance assessment.
Contribution
It proposes a novel evaluation method using Task Priors that considers all possible downstream tasks, addressing limitations of fixed benchmarks in AI research.
Findings
First framework to evaluate models over all possible tasks with defined probabilities.
Enables calculation of average performance and performance variance across tasks.
Aims to set a new standard for model evaluation in SSL.
Abstract
The grand goal of AI research, and particularly Self Supervised Learning (SSL), is to produce systems that can successfully solve any possible task. In contrast, current evaluation methods available to AI researchers typically rely on a fixed collection of hand-picked downstream benchmarks. Hence, a large amount of effort is put into designing and searching for large collection of evaluation tasks that can serve as a proxy of our grand goal. We argue that such a rigid evaluation protocol creates a silent bottleneck in AI research. To remedy that, we define a probabilistic space of downstream tasks obtained by adopting a distribution of tasks and by defining Task Priors. Under this view, one can evaluate a model's performance over the set of all possible downstream tasks. Our framework is the first to provide answers to key questions such as (i) what is the average performance of my…
Peer Reviews
Decision·Submitted to ICLR 2026
The proposed method connects supervised, self-supervised, and kernel-alignment evaluations under a single theoretical lens via Theorem 2.3. The authors provide efficient methods for evaluating the mean and variance of downstream performance with respect to the proposed task prior. The paper provides good empirical coverage of experiments on diverse architectures (CLIP, SigLIP, BLIP, DinoV2). The experiments show a strong correlation between the calculated metric and the linear probe, and with
The core mathematics (e.g., Gibbs measure, kernel alignment, HSIC) builds directly on established theory, albeit with a fresh interpretation for model evaluation. The effect of using different temperatures or selecting the optimal temperature is not described in detail. It would be great to provide a sequence of assumptions that motivate the effectiveness of using a Gibbs distribution to define the task prior. Currently, the prior task is very heuristic, as it is motivated by a metric that shoul
1. Significance: This work, if clarity issues mentioned below are resolved, can have a broad impact on representation learning so that the standard linear probing evaluation protocol becomes unnecessary. 2. Originality is hard to evaluate due to limited domain expertise, see clarity and quality concerns below.
1. Motivation: I wonder why training linear probes for evaluation is a computational bottleneck that we have to avoid since linear probes are quite cheap to train (compared to the time to build the representation) and often done in a few shot manner. 2. There are a few clarity issues about the methodology that make it hard to evaluate the quality of this paper, see questions below.
- The paper redefines model evaluation as a probabilistic process over all possible tasks, providing a fresh and mathematically principled alternative to the static-benchmark paradigm. - Strong theory that establishes a connection between supervised and self-supervised objectives via kernel alignment and trace formulations. - The framework unifies representation-based and task-based evaluation, potentially becoming a standard for fair model comparison.
- Although the method is domain-agnostic in principle, experiments and formulations are focused solely on classification; it’s unclear how Task Priors extend to retrieval, regression, or generative tasks. - The framework heavily relies on the definition of kernel similarity. Sensitivity to kernel type, temperature $T$, and feature normalization could affect reliability. - No guarantee is provided that Task Priors predict performance in unseen domains.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications · Reinforcement Learning in Robotics
