Solving Reasoning Tasks with a Slot Transformer
Ryan Faulkner, Daniel Zoran

TL;DR
The paper introduces the Slot Transformer, a novel architecture combining slot attention and transformers to learn concise, composable abstractions for reasoning about complex scenes and behaviors in video data.
Contribution
It presents the Slot Transformer architecture that effectively models and reasons about temporal and spatial information in videos, outperforming existing baselines.
Findings
Achieves strong performance on CLEVRER, Kinetics-600, and CATER datasets.
Demonstrates robustness in modeling complex behaviors.
Shows effectiveness in predicting from incomplete inputs.
Abstract
The ability to carve the world into useful abstractions in order to reason about time and space is a crucial component of intelligence. In order to successfully perceive and act effectively using senses we must parse and compress large amounts of information for further downstream reasoning to take place, allowing increasingly complex concepts to emerge. If there is any hope to scale representation learning methods to work with real world scenes and temporal dynamics then there must be a way to learn accurate, concise, and composable abstractions across time. We present the Slot Transformer, an architecture that leverages slot attention, transformers and iterative variational inference on video scene data to infer such representations. We evaluate the Slot Transformer on CLEVRER, Kinetics-600 and CATER datesets and demonstrate that the approach allows us to develop robust modeling and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Label Smoothing · Absolute Position Encodings · Layer Normalization
