Nearest Neighbor Normalization Improves Multimodal Retrieval
Neil Chowdhury, Franklin Wang, Sumedh Shenoy, Douwe Kiela, Sarah, Schwettmann, Tristan Thrush

TL;DR
Nearest Neighbor Normalization (NNN) is a simple, training-free method that improves multimodal image-text retrieval accuracy across multiple models and datasets by correcting errors using reference databases.
Contribution
The paper introduces NNN, a novel, efficient correction technique for contrastive retrieval models that enhances performance without additional training or model modifications.
Findings
NNN improves retrieval metrics on multiple models and datasets.
NNN can increase accuracy even after model fine-tuning.
The method requires only a reference database, no extra training.
Abstract
Multimodal models leverage large-scale pre-training to achieve strong but still imperfect performance on tasks such as image captioning, visual question answering, and cross-modal retrieval. In this paper, we present a simple and efficient method for correcting errors in trained contrastive image-text retrieval models with no additional training, called Nearest Neighbor Normalization (NNN). We show an improvement on retrieval metrics in both text retrieval and image retrieval for all of the contrastive models that we tested (CLIP, BLIP, ALBEF, SigLIP, BEiT) and for both of the datasets that we used (MS-COCO and Flickr30k). NNN requires a reference database, but does not require any training on this database, and can even increase the retrieval accuracy of a model after finetuning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsBLIP: Bootstrapping Language-Image Pre-training · ALBEF
