TL;DR
A-TPT introduces angular diversity in test-time prompt tuning for vision-language models, promoting uniformity in textual features to improve calibration and reliability across diverse datasets and tasks.
Contribution
It proposes a novel angular diversity framework that maximizes minimum pairwise angular distances to enhance calibration in TPT of VLMs, addressing limitations of previous methods.
Findings
Outperforms state-of-the-art TPT methods in calibration error reduction.
Maintains comparable accuracy while improving calibration.
Shows strong zero-shot calibration on natural shifts and medical datasets.
Abstract
Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs' reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding…
Peer Reviews
Decision·ICLR 2026 Poster
- Rather than optimizing L2 distance or cosine similarity, the paper optimizes the **angle itself,** which better captures geometric separation on the unit sphere and compensates for the shortcomings of previous work. This paper shows the limitation of previous work well. - The paper includes extensive analyses that illuminate the method’s behavior from multiple perspectives, aiding interpretation and practical use. - It explicitly examines the calibration differences between N > |D| and N ≤ |D|
- When we increase λ, we understand this as trading some accuracy for improved ECE (better calibration). This trend aligns with Flowers102, but Food101 shows a contrasting pattern. Could you provide insight into why the two datasets behave differently? Also, are these curves averaged over multiple seeds, and how large is the variance across runs? - How did you choose the λ term? - In the main performance table, could you report results separately or make them explicitly distinguishabl for the N>
* The paper clearly identifies the shortcomings of prior text-feature dispersion approaches and uses Figure 2 to illustrate them effectively. * The reported ECE gains over baselines such as C-TPT and O-TPT are also encouraging. * The authors show the method also works not only on standard benchmarks used to evaluate CLIP performance, but also on 'calibration critical applications' such as medical domain in Table 4.
* Although the paper proposes angular diversity regularization as a new metric, the method still operates within the existing C-TPT and O-TPT test-time adaptation paradigm, so the contribution feels more incremental than fundamentally novel in terms of theory or technique * Could the proposed method be a complementary to previous methods (e.g., C-TPT or O-TPT). That is, could we for example enforce the proposed angular diversity on top of textual dispersion proposed by C-TPT or the orthogonalit
Derivations are interesting and clear. Equations and derivations seem correct. The point about \arccos normalizing gradient magnitudes is quite interesting. I find multi-letter variable names aesthetically displeasing in general, but the use of "Cos" as a variable name does not impair legibility or correctness in this case. Results show significant consistent reduction in calibration error, with small and inconsistent changes in accuracy, across 15 datasets, in comparison to TPT, C-TPT, and
Minor: Fig. 3 clearly shows that the prompts with the highest ECE ("the nearest shape in this image is" and TPT) are clustered in the center, while other prompts are distributed. This does not show, however, that the prompts with high ECE have low angular diversity, because t-SNE does not show the angles of vectors: it only shows their cluster structure.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
