Attention, Please! Revisiting Attentive Probing Through the Lens of Efficiency
Bill Psomas, Dionysis Christopoulos, Eirini Baltzi, Ioannis Kakogeorgiou, Tilemachos Aravanis, Nikos Komodakis, Konstantinos Karantzalos, Yannis Avrithis, Giorgos Tolias

TL;DR
This paper introduces efficient probing (EP), a lightweight multi-query cross-attention method that improves model evaluation by balancing accuracy and parameter efficiency, outperforming existing attentive probing techniques.
Contribution
The paper provides the first comprehensive analysis of attentive probing methods and proposes EP, a novel, efficient attention mechanism that enhances probing performance while reducing complexity.
Findings
EP outperforms linear and previous attentive probing methods across benchmarks.
EP maintains effectiveness when combined with parameter-efficient fine-tuning.
Analysis reveals complementary attention maps, suggesting new probing applications.
Abstract
As fine-tuning becomes impractical at scale, probing is emerging as the preferred evaluation protocol. However, standard linear probing can understate the capability of models whose pre-training optimizes local representations rather than an explicit global representation. This motivates attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite growing adoption, attentive probing is still underexplored: existing approaches are often over-parameterized and computationally inefficient. In this work, we revisit attentive probing through the lens of the accuracy vs. parameter-efficiency trade-off. We present the first comprehensive study of existing methods, analyzing their design choices and benchmarking their performance. Building on these insights, we propose efficient probing (EP), a lightweight yet effective multi-query cross-attention…
Peer Reviews
Decision·ICLR 2026 Poster
1. The proposed method, EP, is a "simple multi-query cross-attention mechanism" that is derived by "eliminat[ing] redundant projections" from standard MHCA. This simplification is well-illustrated in Figure 1. 2. This work provides the "first comprehensive study" of attentive probing methods. The evaluation is extensive, covering "diverse pre-training paradigms" (MIM, JEA, VLMs, generative) and multiple datasets. 3. EP consistently achieves a state-of-the-art "accuracy-parameter trade-off" ,
1. **Placement of Key Ablations**: The justification for EP's specific design is partially buried in Section 4.3. For instance, EP's design is "transformation-free", notably lacking the $W_K$ projection. The empirical justification for this comes from an ablation showing that while removing $W_K$ from multi-head AIM hurts performance ($75.1\% \rightarrow 72.9\%$), EP's design performs well without it. This key comparison, which validates EP's specific design choice, should be more central to the
1. The biggest strength, in my view, is the systematic benchmark. The field of attentive probing has been a bit all over the place, with different papers (AIM, V-JEPA, etc.) all proposing their own one-off solutions . 2. EP is a nice piece of engineering. It's not flashy, but it's simple, well-motivated (why have redundant projections?), and it just plain works. It consistently lands on the Pareto frontier for both parameter count and GFLOPs, which is exactly what you want from something called
1. While EP is effective, its novelty is a bit thin. The core idea is essentially an ablation study on Multi-Head Cross-Attention (MHCA), 2. The paper frames this entire problem as "probing for evaluation." But what they're doing—freezing the backbone and training a few extra parameters—is exactly what Parameter-Efficient Fine-Tuning (PEFT) is. The authors even acknowledge EP fits into the PEFT family in Appendix A.2. So, why is there no comparison to mainstream PEFT methods like LoRA, Adapters,
1. The paper correctly identifies and addresses the misalignment between standard LP and modern pre-training paradigms (MIM, auto-regressive, diffusion) where discriminative information is distributed across patch tokens. Attentive probing is established as the necessary alternative. 2. This is presented as the "first comprehensive study" of attentive probing, offering a unified framework that categorizes existing methods (including those from unrelated tasks) and thoroughly benchmarks their p
1. The paper positions its contribution against a backdrop where attentive probing is described as "underexplored" and existing methods "suffer from excessive parameterization and poor computational efficiency." While EP solves these issues, the baseline comparison set may be inherently weak due to the newness of the protocol, potentially overstating the relative accuracy gain compared to a hypothetical future efficient method. 2. The motivation for removing the key transformation ($W_{Kj}$) in
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Games and Media
MethodsSoftmax · Attention Is All You Need · Mutual Information Machine/Mask Image Modeling
