Dynamic Mode Decomposition along Depth in Vision Transformers
Nishant Suresh Aswani, Saif Eddin Jabari

TL;DR
This paper investigates whether vision transformer depth can be modeled as autonomous linear dynamics using Dynamic Mode Decomposition, revealing properties of linearization and stability across different depths.
Contribution
It applies Dynamic Mode Decomposition to pretrained ViTs to analyze linear dynamics, showing how operators approximate transformations and how properties change with depth.
Findings
K^p closely tracks endpoint maps for short spans
Operators compress to low rank with minimal data at early layers
Linearization properties decay monotonically with depth
Abstract
Recent work has shown that contiguous vision transformer (ViT) blocks (a) can be replaced by a linear map and (b) organize into recurrent phases of computation. We ask whether these observations coincide: does ViT depth implement approximately \textit{autonomous linear} dynamics, admitting a single operator applied recurrently across a contiguous span? We test this using Dynamic Mode Decomposition (DMD), which fits from selected, consecutive hidden-state pairs and predicts steps ahead via . On four pretrained DINO ViTs, we study the regularization, rank, and calibration budget required for stable fitting. For short spans (), tracks an unconstrained endpoint map to within cosine similarity on DINOv3-H/16+, while also recovering intermediate activations at each skipped block. At early cut starts, the fitted operators compress to rank with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
