Dynamic Mode Decomposition along Depth in Vision Transformers

Nishant Suresh Aswani; Saif Eddin Jabari

arXiv:2605.07556·cs.CV·May 11, 2026

Dynamic Mode Decomposition along Depth in Vision Transformers

Nishant Suresh Aswani, Saif Eddin Jabari

PDF

TL;DR

This paper investigates whether vision transformer depth can be modeled as autonomous linear dynamics using Dynamic Mode Decomposition, revealing properties of linearization and stability across different depths.

Contribution

It applies Dynamic Mode Decomposition to pretrained ViTs to analyze linear dynamics, showing how operators approximate transformations and how properties change with depth.

Findings

01

K^p closely tracks endpoint maps for short spans

02

Operators compress to low rank with minimal data at early layers

03

Linearization properties decay monotonically with depth

Abstract

Recent work has shown that contiguous vision transformer (ViT) blocks (a) can be replaced by a linear map and (b) organize into recurrent phases of computation. We ask whether these observations coincide: does ViT depth implement approximately \textit{autonomous linear} dynamics, admitting a single operator $K$ applied recurrently across a contiguous span? We test this using Dynamic Mode Decomposition (DMD), which fits $K$ from selected, consecutive hidden-state pairs and predicts $p$ steps ahead via $K^{p}$ . On four pretrained DINO ViTs, we study the regularization, rank, and calibration budget required for stable fitting. For short spans ( $p \leq 4$ ), $K^{p}$ tracks an unconstrained endpoint map to within $0.02$ cosine similarity on DINOv3-H/16+, while also recovering intermediate activations at each skipped block. At early cut starts, the fitted operators compress to rank $≪ d$ with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.