Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization
Mikhail Persiianov, Arip Asadulaev, Nikita Andreev, Nikita Starodubcev, Dmitry Baranchuk, Anastasis Kratsios, Evgeny Burnaev, Alexander Korotin

TL;DR
This paper introduces a novel semi-supervised learning method that maximizes data likelihood by integrating paired and unpaired data, leveraging inverse entropic optimal transport, and demonstrating both theoretical recovery guarantees and empirical effectiveness.
Contribution
It develops a new semi-supervised learning framework connecting likelihood maximization with inverse entropic OT, enabling end-to-end learning of conditional distributions from mixed data.
Findings
The method can recover true conditional distributions with arbitrarily small error.
Empirical results show effective learning of conditional distributions using combined data.
The approach establishes a theoretical link between likelihood maximization and inverse entropic OT.
Abstract
Learning conditional distributions is a central problem in machine learning, which is typically approached via supervised methods with paired data . However, acquiring paired data samples is often challenging, especially in problems such as domain translation. This necessitates the development of models that utilize both limited paired data and additional unpaired i.i.d. samples and from the marginal distributions. The usage of such combined data is complex and often relies on heuristic approaches. To tackle this issue, we propose a new learning paradigm that integrates both paired and unpaired data using the data likelihood maximization techniques. We demonstrate that our approach also connects intriguingly with inverse entropic optimal transport (OT). This finding…
Peer Reviews
Decision·Submitted to ICLR 2026
This paper showed that the entropy-regularized inverse OT problem can be formulated as a likelihood maximization problem of an energy-based model. This is an interesting result. The universal approximation property is derived, showing the soundness of the method.
1. One limitation is that this formulation requires that the marginal of the paired data also follows $\pi_x$ and $\pi_y$. For example, if the paired data is artificially selected, i.e., they do not follow $\pi_x$ and $\pi_y$, then the method no longer works: the first term in Eqn (18) is no longer an approximation of the first term in Eqn (13). I suggest making this clearer in the paper. 2. Clearness: There are too many bold, italic, underlined words throughout the paper, even in the abstract.
Overall, the theoretical part of the paper is well-developed and nicely written. The problem is well explained and motivated. The related literature and algorithms are comprehensively reviewed, embedding the paper and its approach in the broader fields of machine learning and optimal transport. The employed model of the unknown conditional distribution and the relation to inverse entropic optimal transport, which is one of the main contributions, is well presented. Besides the b
- Without Appendix C.3 and D.1, the experimental illustrations in §5 are extremely hard to follow. Since the information in these appendices is essential, they should be briefly included in the main text to make §5 self-contained. - The first example (§5.1) deals with the approximation of an synthetic conditional distribution. At first glance, it seems that the goal is to estimate optimal transport plan, which in fact is not entirely true. The construction of the *ground truth
__S1.__ I think the authors do a good job in linking the inverse EOT problem with the semi-supervised domain translation objective. __S2.__ The use of energy based models is also insightful, and nicely decouples the learning terms involving paired and unpaired data. __S3.__ I also think the authors do a nice job in devising a practical algorithm for optimizing equation 13.
__Weakness 1 (Incremental Novelty).__ While the paper is well motivated, its main contribution seems incremental over __(Mokrov et al., 2024)__. For instance, looking at Algorithm 1 in the main paper, the only difference with respect Algorithm 1 of __(Mokrov et al., 2024)__ is the loss function. Other aspects of this submission, such as, 1. The usage of the Gibbs-Boltzmann parametrization, and, 2. The energy function $E(\cdot|x)$ are the same as in the aforementioned paper. __As a consequence,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and ELM
