MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization
Yu Zhang, Qi Zhang, Zixuan Gong, Yiwei Shi, Yepeng Liu, Duoqian Miao,, Yang Liu, Ke Liu, Kun Yi, Wei Fan, Liang Hu, Changwei Wang

TL;DR
MLIP enhances language-image pretraining by utilizing multi-perspective supervision through frequency and spatial domain analysis, improving data efficiency and reducing computational costs compared to traditional CLIP models.
Contribution
The paper introduces MLIP, a novel pretraining approach that leverages frequency transforms and token merging to provide richer supervision and efficiency improvements over CLIP.
Findings
MLIP achieves better performance on multimodal tasks.
It reduces computational costs by token merging.
Extensive experiments validate the effectiveness of MLIP.
Abstract
Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success, leading to rapid advancements in multimodal studies. However, CLIP faces a notable challenge in terms of inefficient data utilization. It relies on a single contrastive supervision for each image-text pair during representation learning, disregarding a substantial amount of valuable information that could offer richer supervision. Additionally, the retention of non-informative tokens leads to increased computational demands and time costs, particularly in CLIP's ViT image encoder. To address these issues, we propose Multi-Perspective Language-Image Pretraining (MLIP). In MLIP, we leverage the frequency transform's sensitivity to both high and low-frequency variations, which complements the spatial domain's sensitivity limited to low-frequency variations only. By incorporating frequency transforms and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsContrastive Language-Image Pre-training
