Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction

Aditya Sarkar; Yi Li; Jiacheng Cheng; Shlok Mishra; Nuno Vasconcelos

arXiv:2601.22570·cs.CV·February 2, 2026

Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction

Aditya Sarkar, Yi Li, Jiacheng Cheng, Shlok Mishra, Nuno Vasconcelos

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a memory-augmented, training-free selective prediction method for visual language models that improves confidence calibration and stability, applicable across various open and closed set tasks.

Contribution

It proposes MA-PaPSP, a novel memory-augmented approach that enhances plug-and-play selective prediction for foundation models by reducing embedding variance and calibrating similarity scores.

Findings

01

MA-PaPSP outperforms baseline methods in multiple tasks

02

Memory augmentation reduces embedding variance

03

Contrastive normalization improves score calibration

Abstract

Selective prediction aims to endow predictors with a reject option, to avoid low confidence predictions. However, existing literature has primarily focused on closed-set tasks, such as visual question answering with predefined options or fixed-category classification. This paper considers selective prediction for visual language foundation models, addressing a taxonomy of tasks ranging from closed to open set and from finite to unbounded vocabularies, as in image captioning. We seek training-free approaches of low-complexity, applicable to any foundation model and consider methods based on external vision-language model embeddings, like CLIP. This is denoted as Plug-and-Play Selective Prediction (PaPSP). We identify two key challenges: (1) instability of the visual-language representations, leading to high variance in image-text embeddings, and (2) poor calibration of similarity scores.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- The focus on creating a training-free, plug-and-play solution that is applicable to open-set tasks like captioning is a valuable direction for improving the safety and reliability of modern VLMs. - While the constituent components (selective prediction, memory augmentation, contrastive scoring) are not new in isolation, their combination to address open-set SP for VLMs in a training-free manner is a simple and well-motivated approach. - The proposed method is technically sound and is validated

Weaknesses

1. **Omission of Inference Cost Analysis:** The claim of being "light-weight" is not substantiated with empirical evidence. The method introduces a potentially computationally expensive k-NN search over a large-scale dataset (e.g., CC12M) and the storage requirements for the embeddings. The paper lacks analysis of latency, throughput, or memory footprint, which is a critical omission for a method proposed for practical application. 2. **Methodological Limitations and Dependencies:** 1. **Per

Reviewer 02Rating 6Confidence 4

Strengths

1. Problem Formulation and Framework: The formulation of the Plug-and-Play Selective Prediction (PaPSP) problem for a taxonomy of VLM tasks, especially open-set scenarios like captioning, is timely and meaningful. 2. Thorough Analysis: The paper provides a comprehensive analysis that successfully identifies and validates the core challenges of representation instability and score miscalibration in baseline methods. 3. Strong Generalization: The proposed MA-PaPSP method demonstrates impressive ge

Weaknesses

1. The paper's claim of being a "lightweight" solution is somewhat challenged by its reliance on a massive external memory bank (e.g., 15M image-text pairs). The computational overhead and storage cost of performing nearest-neighbor retrieval from such a large database during inference should be more thoroughly discussed. This overhead could impact the method's practical deployment in latency-sensitive or resource-constrained environments, and a comparison of inference time against baselines lik

Reviewer 03Rating 4Confidence 3

Strengths

The authors address the selective prediction problem, which aims to balance risk and coverage. This problem has been relatively underexplored in the context of vision-language models, particularly for open-set tasks such as captioning. They propose a novel formulation that constructs neighborhood proxies using retrieval datasets, independent of the input or output modality. In addition, they present a unified framework that can be applied across different types of VLM tasks. Through experiment

Weaknesses

The readability of this paper is unsatisfactory. Figure 1 is quite confusing; while it seems intended to illustrate the differences between classification and captioning in the proposed framework, the lines and arrows instead create misunderstanding. Figure 2 is difficult to interpret and contributes little to improving comprehension of the paper. Figure 4 is quite complex and lacks clear organization. The text in Figure 5 is too small, and the purpose of the figure itself is not clear enough.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling