MLIP: Efficient Multi-Perspective Language-Image Pretraining with   Exhaustive Data Utilization

Yu Zhang; Qi Zhang; Zixuan Gong; Yiwei Shi; Yepeng Liu; Duoqian Miao,; Yang Liu; Ke Liu; Kun Yi; Wei Fan; Liang Hu; Changwei Wang

arXiv:2406.01460·cs.CV·June 5, 2024

MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization

Yu Zhang, Qi Zhang, Zixuan Gong, Yiwei Shi, Yepeng Liu, Duoqian Miao,, Yang Liu, Ke Liu, Kun Yi, Wei Fan, Liang Hu, Changwei Wang

PDF

Open Access

TL;DR

MLIP enhances language-image pretraining by utilizing multi-perspective supervision through frequency and spatial domain analysis, improving data efficiency and reducing computational costs compared to traditional CLIP models.

Contribution

The paper introduces MLIP, a novel pretraining approach that leverages frequency transforms and token merging to provide richer supervision and efficiency improvements over CLIP.

Findings

01

MLIP achieves better performance on multimodal tasks.

02

It reduces computational costs by token merging.

03

Extensive experiments validate the effectiveness of MLIP.

Abstract

Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success, leading to rapid advancements in multimodal studies. However, CLIP faces a notable challenge in terms of inefficient data utilization. It relies on a single contrastive supervision for each image-text pair during representation learning, disregarding a substantial amount of valuable information that could offer richer supervision. Additionally, the retention of non-informative tokens leads to increased computational demands and time costs, particularly in CLIP's ViT image encoder. To address these issues, we propose Multi-Perspective Language-Image Pretraining (MLIP). In MLIP, we leverage the frequency transform's sensitivity to both high and low-frequency variations, which complements the spatial domain's sensitivity limited to low-frequency variations only. By incorporating frequency transforms and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques

MethodsContrastive Language-Image Pre-training