ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference
Ali Hojjat, Janek Haberer, Soren Pirk, Olaf Landsiedel

TL;DR
ThinkingViT introduces a dynamic, nested Vision Transformer architecture that adaptively adjusts inference computation based on input difficulty, improving efficiency and accuracy over fixed-budget models.
Contribution
It proposes a novel nested ViT with progressive thinking stages and token recycling, enabling input-dependent inference and better performance.
Findings
Outperforms nested baselines by up to 2.0 p.p. in accuracy at same throughput
Achieves up to 2.9 p.p. accuracy gain at equal GMACs on ImageNet-1K
Serves as a plug-in upgrade for ViTs and transfers well to other architectures
Abstract
ViTs deliver SOTA performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware. Recent Matryoshka-style Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies. To address this, we introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference computation based on input difficulty. ThinkingViT first activates a small subset of the most important attention heads to produce an initial prediction. If the prediction confidence exceeds a predefined threshold, inference terminates early. Otherwise, within the same backbone, it activates a larger subset of attention heads and conducts a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSensor Technology and Measurement Systems · Neural Networks and Applications
