Transformers can do Bayesian Clustering

Prajit Bhaskaran; Tom Viering

arXiv:2510.24318·cs.LG·March 18, 2026

Transformers can do Bayesian Clustering

Prajit Bhaskaran, Tom Viering

PDF

3 Reviews

TL;DR

This paper introduces Cluster-PFN, a Transformer-based model that efficiently performs Bayesian clustering, accurately estimating the number of clusters and handling missing data better than traditional methods, with high scalability.

Contribution

The paper presents a novel Transformer-based approach, Cluster-PFN, extending PFNs for unsupervised Bayesian clustering trained on synthetic data, capable of handling missing data and outperforming traditional methods.

Findings

01

Cluster-PFN estimates the number of clusters more accurately than AIC, BIC, and VI.

02

It achieves clustering quality comparable to VI but with much faster computation.

03

Performs well on real-world genomic datasets with high missingness.

Abstract

Bayesian clustering accounts for uncertainty but is computationally demanding at scale. Furthermore, real-world datasets often contain missing values, and simple imputation ignores the associated uncertainty, resulting in suboptimal results. We present Cluster-PFN, a Transformer-based model that extends Prior-Data Fitted Networks (PFNs) to unsupervised Bayesian clustering. Trained entirely on synthetic datasets generated from a finite Gaussian Mixture Model (GMM) prior, Cluster-PFN learns to estimate the posterior distribution over both the number of clusters and the cluster assignments. Our method estimates the number of clusters more accurately than handcrafted model selection procedures such as AIC, BIC and Variational Inference (VI), and achieves clustering quality competitive with VI while being orders of magnitude faster. Cluster-PFN can be trained on complex priors that include…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

Demonstrates that a single Transformer forward pass—trained purely on synthetic prior‑samples—can approximate *both* (P(k\mid X)) and responsibilities, a compelling extension of PFNs into unsupervised Bayesian modeling. The special (\rho) token for (k) prediction (Fig. 2) and conditioning mechanism are simple but elegant. Clear runtime wins vs VI, even when VI uses multiple inits (Table 2), and scaling tests up to 20k points show consistent advantages (times reported on p.8).

Weaknesses

For AIC/BIC/silhouette, the search over (k) *excludes (k=1)* “since the silhouette score is undefined for a single cluster” (p.5). But AIC and BIC are perfectly well‑defined at (k=1). Excluding (k=1) likely *penalizes* AIC/BIC whenever the truth is one cluster, inflating Cluster‑PFN’s relative accuracy in Table 1. A fair protocol would allow (k\in{1,\ldots,K}) for AIC/BIC and handle silhouette separately. The paper argues Cluster‑PFN “approximates the true Bayesian posterior over the number of

Reviewer 02Rating 2Confidence 4

Strengths

- The paper reads well overall, making it easy for the reader to follow the core ideas. - To the best of my knowledge, this is the first work to apply Prior-Data Fitted Networks (PFNs) to clustering, opening a promising new direction for amortized Bayesian inference in unsupervised learning.

Weaknesses

- **Limited methodological novelty**: the approach primarily adapts existing PFNs by treating known cluster assignments as supervised labels, with only minor modifications to the training procedure. This incremental extension reduces the overall contribution, and in my opinion, could only be mitigated if highly significant results were provided, which is not the case. - **Lack of motivation**: while the abstract emphasizes *"missingness"* as a central motivation, this aspect is not meaningfully

Reviewer 03Rating 2Confidence 4

Strengths

1) The idea of adapting PFNs to perform Bayesian clustering is interesting and original. 2) The code for reproducibility is available. 3) The authors also discuss the limitations of the proposed approach.

Weaknesses

See *Questions*.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.