Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images

Jamie S. J. Stirling; Noura Al-Moubayed; Hubert P. H. Shum

arXiv:2604.01843·cs.CV·April 3, 2026

Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images

Jamie S. J. Stirling, Noura Al-Moubayed, Hubert P. H. Shum

PDF

TL;DR

This paper introduces a permutation-invariant vector-quantized autoencoder that captures global semantic features of images without positional information, enabling efficient image interpolation and synthesis.

Contribution

It proposes the PI-VQ model with a novel matching quantization algorithm, improving capacity and enabling position-free, interpretable image representations.

Findings

01

PI-VQ captures global semantic features without positional info.

02

Matching quantization increases effective capacity by 3.5×.

03

The approach achieves competitive metrics on CelebA datasets.

Abstract

Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.