Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction
Aditya Sarkar, Yi Li, Jiacheng Cheng, Shlok Mishra, Nuno Vasconcelos

TL;DR
This paper introduces a memory-augmented, training-free selective prediction method for visual language models that improves confidence calibration and stability, applicable across various open and closed set tasks.
Contribution
It proposes MA-PaPSP, a novel memory-augmented approach that enhances plug-and-play selective prediction for foundation models by reducing embedding variance and calibrating similarity scores.
Findings
MA-PaPSP outperforms baseline methods in multiple tasks
Memory augmentation reduces embedding variance
Contrastive normalization improves score calibration
Abstract
Selective prediction aims to endow predictors with a reject option, to avoid low confidence predictions. However, existing literature has primarily focused on closed-set tasks, such as visual question answering with predefined options or fixed-category classification. This paper considers selective prediction for visual language foundation models, addressing a taxonomy of tasks ranging from closed to open set and from finite to unbounded vocabularies, as in image captioning. We seek training-free approaches of low-complexity, applicable to any foundation model and consider methods based on external vision-language model embeddings, like CLIP. This is denoted as Plug-and-Play Selective Prediction (PaPSP). We identify two key challenges: (1) instability of the visual-language representations, leading to high variance in image-text embeddings, and (2) poor calibration of similarity scores.…
Peer Reviews
Decision·ICLR 2026 Poster
- The focus on creating a training-free, plug-and-play solution that is applicable to open-set tasks like captioning is a valuable direction for improving the safety and reliability of modern VLMs. - While the constituent components (selective prediction, memory augmentation, contrastive scoring) are not new in isolation, their combination to address open-set SP for VLMs in a training-free manner is a simple and well-motivated approach. - The proposed method is technically sound and is validated
1. **Omission of Inference Cost Analysis:** The claim of being "light-weight" is not substantiated with empirical evidence. The method introduces a potentially computationally expensive k-NN search over a large-scale dataset (e.g., CC12M) and the storage requirements for the embeddings. The paper lacks analysis of latency, throughput, or memory footprint, which is a critical omission for a method proposed for practical application. 2. **Methodological Limitations and Dependencies:** 1. **Per
1. Problem Formulation and Framework: The formulation of the Plug-and-Play Selective Prediction (PaPSP) problem for a taxonomy of VLM tasks, especially open-set scenarios like captioning, is timely and meaningful. 2. Thorough Analysis: The paper provides a comprehensive analysis that successfully identifies and validates the core challenges of representation instability and score miscalibration in baseline methods. 3. Strong Generalization: The proposed MA-PaPSP method demonstrates impressive ge
1. The paper's claim of being a "lightweight" solution is somewhat challenged by its reliance on a massive external memory bank (e.g., 15M image-text pairs). The computational overhead and storage cost of performing nearest-neighbor retrieval from such a large database during inference should be more thoroughly discussed. This overhead could impact the method's practical deployment in latency-sensitive or resource-constrained environments, and a comparison of inference time against baselines lik
The authors address the selective prediction problem, which aims to balance risk and coverage. This problem has been relatively underexplored in the context of vision-language models, particularly for open-set tasks such as captioning. They propose a novel formulation that constructs neighborhood proxies using retrieval datasets, independent of the input or output modality. In addition, they present a unified framework that can be applied across different types of VLM tasks. Through experiment
The readability of this paper is unsatisfactory. Figure 1 is quite confusing; while it seems intended to illustrate the differences between classification and captioning in the proposed framework, the lines and arrows instead create misunderstanding. Figure 2 is difficult to interpret and contributes little to improving comprehension of the paper. Figure 4 is quite complex and lacks clear organization. The text in Figure 5 is too small, and the purpose of the figure itself is not clear enough.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
