Depth-Wise Representation Development Under Blockwise Self-Supervised Learning for Video Vision Transformers
Jonas R\"omer, Timo Dickscheid

TL;DR
This paper explores blockwise self-supervised learning for video vision transformers, demonstrating that it can produce comparable representations to end-to-end training while offering insights into depth-wise representation development and learning dynamics.
Contribution
It introduces a blockwise training method for masked video transformers, analyzing depth-wise representations and comparing it to traditional end-to-end training.
Findings
Blockwise training converges and yields representations close to end-to-end baselines.
Higher-level structure emerges earlier in blockwise training.
Later blocks saturate and maintain geometry, with token-level shifts indicating early mixing.
Abstract
End-to-end backpropagation couples all layers through a global error signal, enabling coordinated learning but requiring long-range credit assignment. Motivated by recent progress in blockwise self-supervised learning (BWSSL), we ask whether masked video transformers can be trained without end-to-end backpropagation. Applying BWSSL to masked video modeling remains relatively underexplored and must handle spatiotemporal context and long-range temporal structure. More broadly, analyses that compare BWSSL and end-to-end training in terms of learning dynamics and depth-wise representation development remain sparse. We apply blockwise learning to a masked autoencoding video vision transformer by partitioning the encoder into blocks, each of which is optimized with a local masked reconstruction loss. Across model sizes and partition granularities, training converges and yields representations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Domain Adaptation and Few-Shot Learning · Neural Networks and Reservoir Computing
