CLIP-like Model as a Foundational Density Ratio Estimator

Fumiya Uchiyama; Rintaro Yanagi; Shohei Taniguchi; Shota Takashiro; Masahiro Suzuki; Hirokatsu Kataoka; Yusuke Iwasawa; Yutaka Matsuo

arXiv:2506.22881·cs.CV·December 1, 2025

CLIP-like Model as a Foundational Density Ratio Estimator

Fumiya Uchiyama, Rintaro Yanagi, Shohei Taniguchi, Shota Takashiro, Masahiro Suzuki, Hirokatsu Kataoka, Yusuke Iwasawa, Yutaka Matsuo

PDF

Open Access

TL;DR

This paper reinterprets CLIP-like models as density ratio estimators, enabling new applications such as importance weighting and divergence estimation, which enhance multimodal data analysis and curation.

Contribution

It provides a systematic reinterpretation of contrastive vision-language models as density ratio estimators and introduces practical algorithms for importance weighting and divergence estimation.

Findings

01

Importance Weight Learning improves F1 scores by up to 7 points.

02

CLIP-based density ratios effectively estimate KL divergences in multimodal data.

03

KL-guided data curation achieves performance comparable to large-scale filtering methods.

Abstract

Density ratio estimation is a core concept in statistical machine learning because it provides a unified mechanism for tasks such as importance weighting, divergence estimation, and likelihood-free inference, but its potential in vision and language models has not been fully explored. Modern vision-language encoders such as CLIP and SigLIP are trained with contrastive objectives that implicitly optimize log density ratios between joint and marginal image-text distributions, which implicitly learn similarity scores proportional to log density ratios. However, prior work has largely focused on their embedding utility, and the density-ratio structure induced by contrastive learning has not been systematically examined or exploited in multimodal applications. To address this gap, we reinterpret CLIP-style models as pretrained and general-purpose density ratio estimators and show that this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis