Multi-Scale Fusion for Object Representation
Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen

TL;DR
This paper introduces Multi-Scale Fusion, a novel method that improves object-centric learning by leveraging multi-scale image representations and fusion techniques to better handle objects of varying sizes.
Contribution
It proposes a multi-scale fusion approach that enhances VAE guidance in object-centric learning, addressing scale variability in object representations.
Findings
Improves performance of existing OCL methods on standard benchmarks.
Enhances scale-invariance and variance in object super-pixels.
Outperforms state-of-the-art diffusion-based methods.
Abstract
Representing images or videos as object-level feature vectors, rather than pixel-level feature maps, facilitates advanced visual tasks. Object-Centric Learning (OCL) primarily achieves this by reconstructing the input under the guidance of Variational Autoencoder (VAE) intermediate representation to drive so-called \textit{slots} to aggregate as much object information as possible. However, existing VAE guidance does not explicitly address that objects can vary in pixel sizes while models typically excel at specific pattern scales. We propose \textit{Multi-Scale Fusion} (MSF) to enhance VAE guidance for OCL training. To ensure objects of all sizes fall within VAE's comfort zone, we adopt the \textit{image pyramid}, which produces intermediate representations at multiple scales; To foster scale-invariance/variance in object super-pixels, we devise \textit{inter}/\textit{intra-scale…
Peer Reviews
Decision·ICLR 2025 Poster
[S1] The method seems to consistently outperform competitors for the OCL benchmarks, and is even somewhat competitive with the foundation model baseline (DINOSAUR). [S2] The approach is well-motivated. [S3] Figure 3 effectively demonstrates, qualitatively, how the MSF accomplishes better OCL, that is, features that are better connected to the objects themselves.
[W1] The impact of the work seems limited. The ideas about multi-scale representation and fusion themselves are not novel (see for example FPN in object detection literature), but the implementation and application to this task are. However, the implementation neglects potentially more impactfully redesigns of the primary encoder-decoder pipeline or the slot attention mechanism itself. [W2] Additionally, there are no results for any downstream tasks. Thus, while interesting, it is hard to proj
1) The overall intuition and direction of the paper is good and deal with important problems of Object-Centric Learning. 2) The multi-scale training makes a lot sense since dealing with objects is an important problem and generally has not been solved fully in the field. 3) The paper shows good analysis and shows object separability of VAE guidance in Fig3. Fig2 quantitative results also look good.
1) Results on real world datasets especially COCO in Table 1 are really incremental. It’s not really clear how much advantage is by adding MSF. 2) Analysis of varying object sizes can show results on scale understanding better than overall IoU improvement. 3) Is there any intuition as to why the value of n is 3? Can we do a similar experiment on OpenImages and see if this holds true across datasets? For Openimages, if it’s easier you can try using the smaller subset of open images curated in thi
1. The paper is clear writing and easy to follow.
1. I think the paper's novelty is very limited. Although I am not familiar with object-centric learning, I believe that constructing multi-scale VAE features has already been applied in many generative tasks[1, 2]. Therefore, the core innovation of this paper, in my opinion, does not warrant a standalone publication. Can the author tell me about the difference between MSF and other methods? 2. Since I am not familiar with this field, the Area Chair (AC) and the authors may disregard my opinion
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Image Processing and 3D Reconstruction
