LEMoN: Label Error Detection using Multimodal Neighbors
Haoran Zhang, Aparna Balagopalan, Nassim Oufattole, Hyewon Jeong, Yan Wu, Jiacheng Zhu, Marzyeh Ghassemi

TL;DR
LEMoN is a novel method that uses multimodal neighborhood information in pretrained models to detect label errors in image-caption datasets, improving data quality and downstream captioning performance.
Contribution
This paper introduces LEMoN, a new approach for label error detection in multimodal datasets, with theoretical justification and extensive empirical validation.
Findings
LEMoN outperforms existing baselines by over 3% in label error detection.
Filtering datasets with LEMoN improves downstream captioning by over 2 BLEU points.
Empirical validation across eight datasets confirms LEMoN's effectiveness.
Abstract
Large repositories of image-caption pairs are essential for the development of vision-language models. However, these datasets are often extracted from noisy data scraped from the web, and contain many mislabeled instances. In order to improve the reliability of downstream models, it is important to identify and filter images with incorrect captions. However, beyond filtering based on image-caption embedding similarity, no prior works have proposed other methods to filter noisy multimodal data, or concretely assessed the impact of noisy captioning data on downstream training. In this work, we propose, theoretically justify, and empirically validate LEMoN, a method to identify label errors in image-caption datasets. Our method leverages the multimodal neighborhood of image-caption pairs in the latent space of contrastively pretrained multimodal models to automatically identify label…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Risk and Safety Analysis · Anomaly Detection Techniques and Applications
