Vector-Quantized Vision Foundation Models for Object-Centric Learning

Rongzhen Zhao; Vivienne Wang; Juho Kannala; Joni Pajarinen

arXiv:2502.20263·cs.CV·November 11, 2025

Vector-Quantized Vision Foundation Models for Object-Centric Learning

Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen

PDF

Open Access 1 Repo

TL;DR

This paper introduces a unified vector-quantized vision foundation model architecture for object-centric learning, improving object discovery, recognition, and downstream visual tasks by leveraging shared quantized representations.

Contribution

It proposes a novel VQ-VFM-OCL framework that unifies VFM representations in object aggregation and decoding, enhancing OCL performance and providing theoretical insights.

Findings

01

Consistent outperformance over baselines in object discovery and recognition

02

Improved downstream visual prediction and reasoning tasks

03

Mathematical analysis explaining the benefits of shared quantization

Abstract

Object-Centric Learning (OCL) aggregates image or video feature maps into object-level feature vectors, termed \textit{slots}. It's self-supervision of reconstructing the input from slots struggles with complex object textures, thus Vision Foundation Model (VFM) representations are used as the aggregation input and reconstruction target. Existing methods leverage VFM representations in diverse ways yet fail to fully exploit their potential. In response, we propose a unified architecture, Vector-Quantized VFMs for OCL (VQ-VFM-OCL, or VVO). The key to our unification is simply shared quantizing VFM representations in OCL aggregation and decoding. Experiments show that across different VFMs, aggregators and decoders, our VVO consistently outperforms baselines in object discovery and recognition, as well as downstream visual prediction and reasoning. We also mathematically analyze why VFM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Genera1Z/VQ-VFM-OCL
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Algorithms · Explainable Artificial Intelligence (XAI)