Depth-Wise Representation Development Under Blockwise Self-Supervised Learning for Video Vision Transformers

Jonas R\"omer; Timo Dickscheid

arXiv:2601.09040·cs.CV·January 15, 2026

Depth-Wise Representation Development Under Blockwise Self-Supervised Learning for Video Vision Transformers

Jonas R\"omer, Timo Dickscheid

PDF

Open Access

TL;DR

This paper explores blockwise self-supervised learning for video vision transformers, demonstrating that it can produce comparable representations to end-to-end training while offering insights into depth-wise representation development and learning dynamics.

Contribution

It introduces a blockwise training method for masked video transformers, analyzing depth-wise representations and comparing it to traditional end-to-end training.

Findings

01

Blockwise training converges and yields representations close to end-to-end baselines.

02

Higher-level structure emerges earlier in blockwise training.

03

Later blocks saturate and maintain geometry, with token-level shifts indicating early mixing.

Abstract

End-to-end backpropagation couples all layers through a global error signal, enabling coordinated learning but requiring long-range credit assignment. Motivated by recent progress in blockwise self-supervised learning (BWSSL), we ask whether masked video transformers can be trained without end-to-end backpropagation. Applying BWSSL to masked video modeling remains relatively underexplored and must handle spatiotemporal context and long-range temporal structure. More broadly, analyses that compare BWSSL and end-to-end training in terms of learning dynamics and depth-wise representation development remain sparse. We apply blockwise learning to a masked autoencoding video vision transformer by partitioning the encoder into blocks, each of which is optimized with a local masked reconstruction loss. Across model sizes and partition granularities, training converges and yields representations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Domain Adaptation and Few-Shot Learning · Neural Networks and Reservoir Computing