Exploration into Translation-Equivariant Image Quantization
Woncheol Shin, Gyubok Lee, Jiyoung Lee, Eunyi Lyou, Joonseok Lee,, Edward Choi

TL;DR
This paper introduces a translation-equivariant image quantization method that enforces orthogonality among codebook embeddings, improving sample efficiency and accuracy in image and text generation tasks.
Contribution
It proposes a novel orthogonality-based approach to achieve translation-equivariance in image quantization, addressing aliasing issues in current methods.
Findings
Improves sample efficiency in image and text generation tasks.
Achieves up to +11.9% accuracy in text-to-image generation.
Enhances image-to-text generation accuracy by +3.9%.
Abstract
This is an exploratory study that discovers the current image quantization (vector quantization) do not satisfy translation equivariance in the quantized space due to aliasing. Instead of focusing on anti-aliasing, we propose a simple yet effective way to achieve translation-equivariant image quantization by enforcing orthogonality among the codebook embeddings. To explore the advantages of translation-equivariant image quantization, we conduct three proof-of-concept experiments with a carefully controlled dataset: (1) text-to-image generation, where the quantized image indices are the target to predict, (2) image-to-text generation, where the quantized image indices are given as a condition, (3) using a smaller training set to analyze sample efficiency. From the strictly controlled experiments, we empirically verify that the translation-equivariant image quantizer improves not only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques
