ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference

Ali Hojjat; Janek Haberer; Soren Pirk; Olaf Landsiedel

arXiv:2507.10800·cs.CV·March 27, 2026

ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference

Ali Hojjat, Janek Haberer, Soren Pirk, Olaf Landsiedel

PDF

Open Access

TL;DR

ThinkingViT introduces a dynamic, nested Vision Transformer architecture that adaptively adjusts inference computation based on input difficulty, improving efficiency and accuracy over fixed-budget models.

Contribution

It proposes a novel nested ViT with progressive thinking stages and token recycling, enabling input-dependent inference and better performance.

Findings

01

Outperforms nested baselines by up to 2.0 p.p. in accuracy at same throughput

02

Achieves up to 2.9 p.p. accuracy gain at equal GMACs on ImageNet-1K

03

Serves as a plug-in upgrade for ViTs and transfers well to other architectures

Abstract

ViTs deliver SOTA performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware. Recent Matryoshka-style Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies. To address this, we introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference computation based on input difficulty. ThinkingViT first activates a small subset of the most important attention heads to produce an initial prediction. If the prediction confidence exceeds a predefined threshold, inference terminates early. Otherwise, within the same backbone, it activates a larger subset of attention heads and conducts a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSensor Technology and Measurement Systems · Neural Networks and Applications