Subjective Depth and Timescale Transformers: Learning Where and When to Compute
Frederico Wieser, Martin Benfeghoul, Haitham Bou Ammar, Jun Wang, and Zafeirios Fountas

TL;DR
This paper introduces Subjective Depth and Timescale Transformers that dynamically allocate computation based on Bayesian surprise signals, significantly improving efficiency while maintaining performance.
Contribution
The paper proposes novel Transformer architectures that learn where and when to compute using Bayesian surprise, reducing computation and memory usage.
Findings
Reduced self-attention computation by 75%
Cut KV-cache requirements by 50%
Showed a shift from novelty to prediction driven gating during training
Abstract
The rigid, uniform allocation of computation in standard Transformer (TF) architectures can limit their efficiency and scalability, particularly for large-scale models and long sequences. Addressing this, we introduce Subjective Depth Transformers (SDT) and Subjective Timescale Transformers (STT), two distinct architectures that leverage Bayesian surprise signals to dynamically route computation, learning where and when to compute within decoder-only TFs. SDT augments a decoder-only stack with alternating Decision and Dynamic layers: a Decision layer computes a full block 'posterior' and a lightweight 'prior,' while a Dynamic layer employs fixed-capacity Top-K routing based on Bayesian surprise (Expected and Unexpected Change), maintaining a static compute graph. STT extends this conditional computation to the temporal domain: a transition network predicts residual updates, forming a…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. **Well-motivated signal**: Instead of training an extra router that learns arbitrary scores, the method derives the routing score from an interpretable discrepancy (posterior vs. prior / input), which is more principled and ties to predictive coding. 2. **Hardware-friendly design**: SDT keeps a static graph and fixed Top-K; STT also gives a fixed-capacity variant before introducing the dynamic one. This is practical. 3. **Training-dynamics insight**: they show CE and CU do not dominate at t
1. **Main result is weak**: every conditional model is below the dense baseline, sometimes by a noticeable margin, so the narrative has to lean on “we saved compute.” But the paper reports mostly theoretical savings (e.g. 62.5% self-attention under γ=0.5) rather than real wall-clock / memory numbers on commodity GPUs. 2. **Single scale / single γ**: nearly all conclusions are drawn from Qwen2.5-0.5B with γ=0.5. To claim the method is robust, we need γ∈{0.25,0.5,0.75} or at least “every layer vs
1. The surprise proxy grounded in a KL/MSE derivation gives a coherent and unified differentiable gate, aligning novelty (CU) and prediction (CE) criteria with predictive-coding literature and enabling direct optimization under LM loss via residual scaling. 2. Fixed-capacity Top-K routing preserves a static compute graph, paralleling MoD while contributing a theoretically motivated score function and a causal router design suitable for autoregressive inference. 3. SDT separates prior formation
1. The core routing signal assumes hidden states are draws from isotropic Gaussians with shared covariance, so $$ D_{\mathrm{KL}}\big(N(\mu_p,kI)\,\|\,N(\mu_q,kI)\big)=\frac{1}{2k}\|\mu_p-\mu_q\|_2^2 $$, which the method operationalizes as an MSE over residuals to define surprise $$D_{\mathrm{st}}$$ and $$D_{\mathrm{ch}}$$ after a 1/d rescaling, but the paper does not test the isotropy assumption nor report any calibration of this proxy against uncertainty or next-token error, leaving validity o
The paper's writing is mostly understandable and clear. The authors do a good job of explaining related works and establishing the relevance of this work in context of earlier works. The general experimental setting is also sound (see below regarding caveats).
One of my major concerns is that the current method requires the block to be executed as the gating depends on the output of the block. This means that at inference time you still need to compute the block for all the tokens and then use the gating to decide whether or not to apply it. Therefore this only increases the amount of compute. In contrast, MoD leads to actual savings in compute at inference time as you can skip a block simply based on the router's decision. Additionally, in light of
A> The idea of implementing Bayesian surprise is novel. B> The two networks focus on improving two different aspects of efficiency. C> The hardware aware design (modifying based on available memory) makes the pipeline versatile.
A> The optimal value of top-k is missing. B> The experiments are performed on smaller LLM models. Comparison with existing Mixture of models that implement gating (example [1]) is missing. C> The paper does not discuss any metrics that indicate computational load - for instance, MACs, FLOPs or number of trainable parameters. [1] Huang, Haiyang, et al. "Toward efficient inference for mixture of experts." Advances in Neural Information Processing Systems 37 (2024): 84033-84059.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware-Defined Networks and 5G · Parallel Computing and Optimization Techniques · Network Packet Processing and Optimization
