Nearest Neighbor Normalization Improves Multimodal Retrieval

Neil Chowdhury; Franklin Wang; Sumedh Shenoy; Douwe Kiela; Sarah; Schwettmann; Tristan Thrush

arXiv:2410.24114·cs.CV·November 1, 2024

Nearest Neighbor Normalization Improves Multimodal Retrieval

Neil Chowdhury, Franklin Wang, Sumedh Shenoy, Douwe Kiela, Sarah, Schwettmann, Tristan Thrush

PDF

Open Access 1 Repo 1 Video

TL;DR

Nearest Neighbor Normalization (NNN) is a simple, training-free method that improves multimodal image-text retrieval accuracy across multiple models and datasets by correcting errors using reference databases.

Contribution

The paper introduces NNN, a novel, efficient correction technique for contrastive retrieval models that enhances performance without additional training or model modifications.

Findings

01

NNN improves retrieval metrics on multiple models and datasets.

02

NNN can increase accuracy even after model fine-tuning.

03

The method requires only a reference database, no extra training.

Abstract

Multimodal models leverage large-scale pre-training to achieve strong but still imperfect performance on tasks such as image captioning, visual question answering, and cross-modal retrieval. In this paper, we present a simple and efficient method for correcting errors in trained contrastive image-text retrieval models with no additional training, called Nearest Neighbor Normalization (NNN). We show an improvement on retrieval metrics in both text retrieval and image retrieval for all of the contrastive models that we tested (CLIP, BLIP, ALBEF, SigLIP, BEiT) and for both of the datasets that we used (MS-COCO and Flickr30k). NNN requires a reference database, but does not require any training on this database, and can even increase the retrieval accuracy of a model after finetuning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

multimodal-interpretability/nnn
pytorchOfficial

Videos

Nearest Neighbor Normalization Improves Multimodal Retrieval· underline

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsBLIP: Bootstrapping Language-Image Pre-training · ALBEF