Vector-Quantized Vision Foundation Models for Object-Centric Learning
Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen

TL;DR
This paper introduces a unified vector-quantized vision foundation model architecture for object-centric learning, improving object discovery, recognition, and downstream visual tasks by leveraging shared quantized representations.
Contribution
It proposes a novel VQ-VFM-OCL framework that unifies VFM representations in object aggregation and decoding, enhancing OCL performance and providing theoretical insights.
Findings
Consistent outperformance over baselines in object discovery and recognition
Improved downstream visual prediction and reasoning tasks
Mathematical analysis explaining the benefits of shared quantization
Abstract
Object-Centric Learning (OCL) aggregates image or video feature maps into object-level feature vectors, termed \textit{slots}. It's self-supervision of reconstructing the input from slots struggles with complex object textures, thus Vision Foundation Model (VFM) representations are used as the aggregation input and reconstruction target. Existing methods leverage VFM representations in diverse ways yet fail to fully exploit their potential. In response, we propose a unified architecture, Vector-Quantized VFMs for OCL (VQ-VFM-OCL, or VVO). The key to our unification is simply shared quantizing VFM representations in OCL aggregation and decoding. Experiments show that across different VFMs, aggregators and decoders, our VVO consistently outperforms baselines in object discovery and recognition, as well as downstream visual prediction and reasoning. We also mathematically analyze why VFM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Algorithms · Explainable Artificial Intelligence (XAI)
