Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images
Jamie S. J. Stirling, Noura Al-Moubayed, Hubert P. H. Shum

TL;DR
This paper introduces a permutation-invariant vector-quantized autoencoder that captures global semantic features of images without positional information, enabling efficient image interpolation and synthesis.
Contribution
It proposes the PI-VQ model with a novel matching quantization algorithm, improving capacity and enabling position-free, interpretable image representations.
Findings
PI-VQ captures global semantic features without positional info.
Matching quantization increases effective capacity by 3.5×.
The approach achieves competitive metrics on CelebA datasets.
Abstract
Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
