Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence

Zhiyuan Li; Rongzhen Zhao; Wenyan Yang; Wenshuai Zhao; Pekka Marttinen; Joni Pajarinen

arXiv:2605.03650·cs.CV·May 12, 2026

Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence

Zhiyuan Li, Rongzhen Zhao, Wenyan Yang, Wenshuai Zhao, Pekka Marttinen, Joni Pajarinen

PDF

1 Repo

TL;DR

This paper proposes a new framework for video object-centric learning that replaces learned temporal prediction modules with a deterministic matching approach, leveraging existing features for improved efficiency and performance.

Contribution

It introduces Grounded Correspondence, a parameter-free method that uses bipartite matching on frozen features to maintain object identity over time.

Findings

01

Achieves competitive results on MOVi-D, MOVi-E, and YouTube-VIS datasets.

02

Eliminates the need for learnable temporal dynamics modules.

03

Utilizes existing backbone features for object correspondence.

Abstract

The de facto approach in video object-centric learning maintains temporal consistency through learned dynamics modules that predict future object representations, called slots. We demonstrate that these predictors function as expensive approximations of discrete correspondence problems. Modern self-supervised vision backbones already encode instance-discriminative features that distinguish objects reliably. Exploiting these features eliminates the need for learned temporal prediction. We introduce Grounded Correspondence, a framework that replaces learned transition functions with deterministic bipartite matching. Slots initialize from salient regions in frozen backbone features. Frame-to-frame identity is maintained through Hungarian matching on slot representations. The approach requires zero learnable parameters for temporal modeling yet achieves competitive performance on MOVi-D,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://magenta-sherbet-85b101.netlify.app
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.