Learning Distributions over Permutations and Rankings with Factorized Representations
Daniel Severo, Brian Karrer, Niklas Nolte

TL;DR
This paper introduces a novel permutation learning method using bijective representations that enables unconstrained deep learning, outperforming existing models on benchmarks and allowing flexible trade-offs between expressivity and computational cost.
Contribution
The authors propose a new approach leveraging permutation representations like Lehmer codes, enabling unconstrained deep learning over permutations and unifying various probabilistic models.
Findings
Outperforms current approaches on the jigsaw puzzle benchmark.
Can learn non-trivial permutation distributions even in low expressivity modes.
Traditional models fail to generate valid permutations in minimal expressivity settings.
Abstract
Learning distributions over permutations is a fundamental problem in machine learning, with applications in ranking, combinatorial optimization, structured prediction, and data association. Existing methods rely on mixtures of parametric families or neural networks with expensive variational inference procedures. In this work, we propose a novel approach that leverages alternative representations for permutations, including Lehmer codes, Fisher-Yates draws, and Insertion-Vectors. These representations form a bijection with the symmetric group, allowing for unconstrained learning using conventional deep learning techniques, and can represent any probability distribution over permutations. Our approach enables a trade-off between expressivity of the model family and computational requirements. In the least expressive and most computationally efficient case, our method subsumes previous…
Peer Reviews
Decision·ICLR 2026 Poster
The paper invokes a number of interesting, classical formulations of permutations that I was not familiar with. I learned a lot from reading the paper. The paper is a good example of how to combine classic algorithmic formulations with neural networks to get novel/flexible distributions over structured objects.
I found the (num function evaluations) NFE formalism confusing. This is used to provide a tradeoff between compute cost and expressivity of the resulting distribution. I understand the NFE=1 and NFE=k regimes, but I don't know how to interpret a value in between and I don't understand how the partitioning was chosen. I also don't know why this variable was so central in the experiments. To me, I was most interested in the different representations, not sweeping over values of NFE. I have some
1) **Clear, well-motivated idea.** Replacing inline notation with factorized bijections is simple but powerful: it directly addresses invalid outputs in fully-factorized masked models and provides a tradeoff knob between expressivity and compute. 2) **Theoretical insight.** The paper formally characterizes why inline representations collapse to degenerate distributions under 1 NFE and proves a useful identity (Theorem 4.3) that enables efficient batched insertion-vector decoding. 3) **Rema
1. **Scalability to large n.** The results of training/serving for very large rankings (e.g., thousands of items) remains unknown. Jigsaw and cyclic (n=10) are illustrative but small; MovieLens experiments are more realistic but limited to subrankings (n=50). Behavior at larger scale is untested. 2. **Representation selection requires prior knowledge.** Choosing the best factorized representation (e.g., Fisher–Yates for cyclic structure) is sensible but introduces an extra design choice and po
The core idea seems novel and natural, as there are seemingly many ways to represent permutations with no intuitive reason to prefer the inline notation. On the synthetic experiments the results are also quite good.
For this last task in particular, there are some issues. The measurement is NDCG@k which as I understand it doesn’t require a distribution? Even if one explicitly wanted a distribution over rankings for the sake of, say, uncertainty quantification, I also feel a somewhat straightforward baseline is missing from this task. Namely, one could learn a distribution over each individual ranking of a film conditioned on a user, and then implicitly induce a distribution over rankings by sampling scor
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Bayesian Modeling and Causal Inference · Imbalanced Data Classification Techniques
MethodsJigsaw · Variational Inference
