From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning
Hyun Seok Seong, WonJun Moon, Jae-Pil Heo

TL;DR
This paper introduces Synergistic Representation Learning (SRL), a novel method that enables mutual refinement between encoder and decoder in unsupervised video object-centric models, overcoming the reconstruction conflict and achieving state-of-the-art results.
Contribution
SRL establishes a virtuous cycle between encoder and decoder, improving scene decomposition by mutual refinement and stabilizing training with a warm-up phase.
Findings
Achieves state-of-the-art results on video object-centric benchmarks.
Effectively deblurs semantic boundaries in decoder outputs.
Reduces noise in encoder features through decoder supervision.
Abstract
Unsupervised object-centric learning models, particularly slot-based architectures, have shown great promise in decomposing complex scenes. However, their reliance on reconstruction-based training creates a fundamental conflict between the sharp, high-frequency attention maps of the encoder and the spatially consistent but blurry reconstruction maps of the decoder. We identify that this discrepancy gives rise to a vicious cycle: the noisy feature map from the encoder forces the decoder to average over possibilities and produce even blurrier outputs, while the gradient computed from blurry reconstruction maps lacks high-frequency details necessary to supervise encoder features. To break this cycle, we introduce Synergistic Representation Learning (SRL) that establishes a virtuous cycle where the encoder and decoder mutually refine one another. SRL leverages the encoder's sharpness to…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper presents a novel opinion: a vicious cycle between the encoder and decoder in video object-centric learning. 2. The paper is well-organized, and the experimental design is clear.
1. The authors propose a vicious cycle in unsupervised video object-centric learning, where noisy encoder inputs lead to blurry, low-frequency decoder outputs, which in turn fail to refine the encoder's features. However, it remains unclear whether this phenomenon truly exists during training, and whether it worsens as training progresses. Qualitative or quantitative experiments are necessary to justify this claim. 2. In the comparison experiments presented in Table.1, SRL does not show a clear
1 - The problem discussed in the paper is an interesting observation, which has not been explored in the previous works. 2 - The paper shows better empirical results than state-of-the-art. 3 - The paper is clean, and the method has been elaborated in details for different stages.
1 - Although I agree with the identified problem, the proposed solution appears overly complex. It consists of three stages defined by the proportion of training iterations completed. Since different backbones or even datasets may require varying numbers of iterations, the approach seems unlikely to generalize well. Could you please report results with other foundation models such as Dino3 [1] or Franca [2] to verify generalizability? 2 - Recently, several post-training methods have been introd
Intuitve formulation on a widely observed phenomenon and feasible solutions.
(Writing issues first but not most important) W1 --- Line 017 "We identify that this discrepancy gives rise to a vicious cycle; the noisy ...": ";" should be ":". W2 --- Line 248 Equation (4) vs (6): Compared with (6), Equation (4) seems missing the $\Sigma$ and $\frac{1}{|P|}$ on the first term. W3 --- Line 276, 287 Equation (6) and (7): Should be written in one Equation, not two. W4 --- Line 144-145 "representational conflict between the slot attention maps and **reconstruction maps**": "r
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Adversarial Robustness in Machine Learning
