TL;DR
This paper introduces Cluster-PFN, a Transformer-based model that efficiently performs Bayesian clustering, accurately estimating the number of clusters and handling missing data better than traditional methods, with high scalability.
Contribution
The paper presents a novel Transformer-based approach, Cluster-PFN, extending PFNs for unsupervised Bayesian clustering trained on synthetic data, capable of handling missing data and outperforming traditional methods.
Findings
Cluster-PFN estimates the number of clusters more accurately than AIC, BIC, and VI.
It achieves clustering quality comparable to VI but with much faster computation.
Performs well on real-world genomic datasets with high missingness.
Abstract
Bayesian clustering accounts for uncertainty but is computationally demanding at scale. Furthermore, real-world datasets often contain missing values, and simple imputation ignores the associated uncertainty, resulting in suboptimal results. We present Cluster-PFN, a Transformer-based model that extends Prior-Data Fitted Networks (PFNs) to unsupervised Bayesian clustering. Trained entirely on synthetic datasets generated from a finite Gaussian Mixture Model (GMM) prior, Cluster-PFN learns to estimate the posterior distribution over both the number of clusters and the cluster assignments. Our method estimates the number of clusters more accurately than handcrafted model selection procedures such as AIC, BIC and Variational Inference (VI), and achieves clustering quality competitive with VI while being orders of magnitude faster. Cluster-PFN can be trained on complex priors that include…
Peer Reviews
Decision·Submitted to ICLR 2026
Demonstrates that a single Transformer forward pass—trained purely on synthetic prior‑samples—can approximate *both* (P(k\mid X)) and responsibilities, a compelling extension of PFNs into unsupervised Bayesian modeling. The special (\rho) token for (k) prediction (Fig. 2) and conditioning mechanism are simple but elegant. Clear runtime wins vs VI, even when VI uses multiple inits (Table 2), and scaling tests up to 20k points show consistent advantages (times reported on p.8).
For AIC/BIC/silhouette, the search over (k) *excludes (k=1)* “since the silhouette score is undefined for a single cluster” (p.5). But AIC and BIC are perfectly well‑defined at (k=1). Excluding (k=1) likely *penalizes* AIC/BIC whenever the truth is one cluster, inflating Cluster‑PFN’s relative accuracy in Table 1. A fair protocol would allow (k\in{1,\ldots,K}) for AIC/BIC and handle silhouette separately. The paper argues Cluster‑PFN “approximates the true Bayesian posterior over the number of
- The paper reads well overall, making it easy for the reader to follow the core ideas. - To the best of my knowledge, this is the first work to apply Prior-Data Fitted Networks (PFNs) to clustering, opening a promising new direction for amortized Bayesian inference in unsupervised learning.
- **Limited methodological novelty**: the approach primarily adapts existing PFNs by treating known cluster assignments as supervised labels, with only minor modifications to the training procedure. This incremental extension reduces the overall contribution, and in my opinion, could only be mitigated if highly significant results were provided, which is not the case. - **Lack of motivation**: while the abstract emphasizes *"missingness"* as a central motivation, this aspect is not meaningfully
1) The idea of adapting PFNs to perform Bayesian clustering is interesting and original. 2) The code for reproducibility is available. 3) The authors also discuss the limitations of the proposed approach.
See *Questions*.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
