Reasoning-Enhanced Object-Centric Learning for Videos
Jian Li, Pu Ren, Yang Liu, Hao Sun

TL;DR
This paper introduces a novel reasoning module called STATM that enhances object-centric video models by improving their perception, prediction, and reasoning abilities, especially in complex scenes, and demonstrates its effectiveness across multiple tasks.
Contribution
The paper proposes the STATM module, a new reasoning component that significantly boosts the performance of object-centric video models in perception and reasoning tasks.
Findings
STATM improves object segmentation and tracking accuracy.
Enhanced performance in downstream prediction tasks.
Effective in Visual Question Answering (VQA) applications.
Abstract
Object-centric learning aims to break down complex visual scenes into more manageable object representations, enhancing the understanding and reasoning abilities of machine learning systems toward the physical world. Recently, slot-based video models have demonstrated remarkable proficiency in segmenting and tracking objects, but they overlook the importance of the effective reasoning module. In the real world, reasoning and predictive abilities play a crucial role in human perception and object tracking; in particular, these abilities are closely related to human intuitive physics. Inspired by this, we designed a novel reasoning module called the Slot-based Time-Space Transformer with Memory buffer (STATM) to enhance the model's perception ability in complex scenes. The memory buffer primarily serves as storage for slot information from upstream modules, the Slot-based Time-Space…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
The paper is tackling an interesting problem (decomposing dynamic scenes / videos into objects), using a technique inspired by human cognition (people have working memory, and use motion cues to decompose scenes into objects). There is a large community of researchers interested in slot-based models, and this paper provides a general mechanism for how to improve slot-based scene representations by incorporating a history of previous time-steps. The paper is reasonably easy to follow. The method
Overall, while the paper is tackling an established problem with a method inspired by human cognition, there are several issues that would need to be addressed before it could be impactful for the broader community. These can be generally grouped into: * Adding comparisons to prior work that uses multiple history steps to infer segmentation masks * Removing false statements about the SAVi/SAVi++ results * Expanding explanation / background on SAVi for unfamiliar readers * Toning down / removing
The authors proposed an intuitive model for better modeling the spatial-temporal consistency between object-centric representations during video object-centric learning. Compared with previous models which mainly used simple models like self-attention for linking information between frames, this design considers longer history and more complex attention between slots at different time steps. By only adding this module, we can observe consistent improvement over prior architecture and achieving n
[-] Despite the performance improvement of adding this STATM module, the quantitative results of the backbone models seem to be exhibiting a rather big gap on several datasets (e.g. SAVi and SAVi++ on MOVi-E compared with results reported from the SAVi++ paper [here](https://browse.arxiv.org/pdf/2206.07764.pdf)). This hinders the evaluation of this paper's contributions, the authors might want to clarify the experimental settings to make these results more convincing. [-] Following the previous
This paper tackles an important problem of improving object-centric learning models for videos. While spatiotemporal transformers have been applied to Slot Attention-based video models before [1], they have not been trained in an end-to-end fashion, as far as I know. The experimental results show improvements over SAVi and SAVi++, especially on the more complex MOVi-D and MOVi-E datasets, although I have concerns about these results that I state below. [1] SlotFormer: Unsupervised Visual Dynami
There a few instances in the paper where the authors make statements that are not well-supported by their experiments. For example, - The title and abstract emphasize reasoning, but this is not supported by any experiments. I think it is fine to use reasoning as a motivation for their model, but if the title includes “Reasoning-Enhanced”, I would have expected some experiments showing this ability. - Similarly, I feel several statements connecting their model to human behavior are too strong an
- The Predictor is indeed an under-explored component in SAVi-like models. This paper serves a preliminary attempt in this direction - The ablation studies are thorough
My biggest concern is regarding the experimental settings: - The reported performances of baselines are not with their best training configs. The reason "computation constraint" is not acceptable, as I am not sure with longer training & larger batch size, will the performance gain disappear - As a result, the performance of STATM is not SOTA. SAVi++ reports a mIoU of 47.1 on MOVi-E, which is much higher than this paper - Only comparing with SAVi and SAVi++ is also not enough. There have been sev
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Multi-Head Attention · Softmax · Dropout
