NLIP: Noise-robust Language-Image Pre-training
Runhui Huang, Yanxin Long, Jianhua Han, Hang Xu, Xiwen Liang, Chunjing, Xu, Xiaodan Liang

TL;DR
NLIP introduces a noise-robust pre-training framework for image-text models that effectively handles noisy web data through noise-harmonization and noise-completion, leading to improved performance on various downstream tasks.
Contribution
The paper proposes a novel noise mitigation approach in cross-modal pre-training that jointly addresses incorrect and incomplete data issues without manual data cleaning.
Findings
Significant performance gains on zero-shot classification, captioning, and retrieval tasks.
Effective noise handling with only 26M data, outperforming existing models.
Enhanced robustness and stability in large-scale image-text pre-training.
Abstract
Large-scale cross-modal pre-training paradigms have recently shown ubiquitous success on a wide range of downstream tasks, e.g., zero-shot classification, retrieval and image captioning. However, their successes highly rely on the scale and quality of web-crawled data that naturally contain incomplete and noisy information (e.g., wrong or irrelevant content). Existing works either design manual rules to clean data or generate pseudo-targets as auxiliary signals for reducing noise impact, which do not explicitly tackle both the incorrect and incomplete challenges simultaneously. In this paper, to automatically mitigate the impact of noise by solely mining over existing data, we propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion. First, in noise-harmonization scheme, NLIP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
