Contrastive Predictive Coding Done Right for Mutual Information Estimation

J. Jon Ryu; Pavan Yeddanapudi; Xiangxiang Xu; Gregory W. Wornell

arXiv:2510.25983·cs.LG·October 31, 2025

Contrastive Predictive Coding Done Right for Mutual Information Estimation

J. Jon Ryu, Pavan Yeddanapudi, Xiangxiang Xu, Gregory W. Wornell

PDF

3 Reviews

TL;DR

This paper critically examines the use of InfoNCE for mutual information estimation, introduces a corrected estimator called InfoNCE-anchor, and unifies various contrastive objectives under a single framework, revealing insights into their effectiveness.

Contribution

It presents a new, bias-reduced MI estimator called InfoNCE-anchor and a unified theoretical framework for contrastive objectives using proper scoring rules.

Findings

01

InfoNCE is not a valid MI estimator.

02

InfoNCE-anchor achieves more accurate MI estimates.

03

Contrastive learning improves downstream tasks through structured density ratios, not MI accuracy.

Abstract

The InfoNCE objective, originally introduced for contrastive representation learning, has become a popular choice for mutual information (MI) estimation, despite its indirect connection to MI. In this paper, we demonstrate why InfoNCE should not be regarded as a valid MI estimator, and we introduce a simple modification, which we refer to as InfoNCE-anchor, for accurate MI estimation. Our modification introduces an auxiliary anchor class, enabling consistent density ratio estimation and yielding a plug-in MI estimator with significantly reduced bias. Beyond this, we generalize our framework using proper scoring rules, which recover InfoNCE-anchor as a special case when the log score is employed. This formulation unifies a broad spectrum of contrastive objectives, including NCE, InfoNCE, and $f$ -divergence variants, under a single principled framework. Empirically, we find that…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper is clearly written with precise formulations. 2. The theoretical analysis provides a sharp upper bound on InfoNCE via $K$-way JS divergence, clarifying its high bias. 3. The unified framework through proper scoring rules elegantly connects NCE, InfoNCE, and f-divergence variants.

Weaknesses

1. Theorem 2 assumes known distributions $q_1$ and $q_0$ for equality conditions, but in practice with neural critics of finite capacity, the proportionality $r_θ$ ∝ $\frac{q_1}{q_0}$ may not hold, leaving gaps in how approximation errors affect the bound's tightness. 2. Critique of α-InfoNCE in Section 3.3 claims a proof flaw without supplying a counterexample or alternative derivation. 3. The extension to proper scoring rules claims consistency for class probability estimation, but this paper

Reviewer 02Rating 6Confidence 3

Strengths

1. The analysis of InfoNCE's limitations is sharp and well-motivated, with a tight bound on its divergence (Theorem 2) that clarifies why it underestimates MI even for large K. The anchor modification is elegant and directly addresses the identifiability issue in density ratio estimation (Theorem 3). The generalization to proper scoring rules is a nice unification, recovering existing methods as special cases while providing a principled decision-theoretic foundation. 2. Strong results in MI est

Weaknesses

1. While the SSL experiments are thorough, they are restricted to CIFAR-100 with a ResNet-18 backbone. It would be valuable to test on larger datasets or architectures (e.g., ViTs) to confirm if the lack of improvement holds more generally. 2. The choice of ν=1 is defaulted without extensive tuning; sensitivity analysis (e.g., ν vs. performance) could reveal trade-offs, especially since asymptotic behavior links ν/K to bounds like DV/NWJ. 3. The anchor introduces an extra term, potentially incre

Reviewer 03Rating 6Confidence 4

Strengths

- This study presents a theoretically grounded method to enhance existing mutual information (MI) estimation techniques and provides empirical evidence demonstrating its effectiveness. - The proposed “anchor” modification is straightforward yet addresses a subtle theoretical issue in density ratio identifiability. - The MI estimation experiments are comprehensive and show consistent advantages of the proposed method across different domains.

Weaknesses

- Although theoretically neat, the proposed modification brings no tangible improvement to representation learning — arguably the main motivation for contrastive objectives. - Only a few relatively simple contrastive methods are considered; comparison with modern frameworks would strengthen the practical side.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.