Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training

Rian Atri

arXiv:2603.09253·cs.LG·March 11, 2026

Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training

Rian Atri

PDF

Open Access

TL;DR

This paper introduces length-aware attention priors and gain-aware training techniques to improve the efficiency and accuracy of small and medium Transformers without increasing test-time computational costs.

Contribution

It proposes novel training components that transfer to broader models, enabling structured attention guidance and dynamic sharpness control without inference overhead.

Findings

01

Reduces validation cross entropy under compute constraints

02

Maintains baseline latency and memory usage at inference

03

Enhances long-span, noisy logit regime performance

Abstract

We study efficient reasoning under tight compute. We ask how to make structured, correct decisions without increasing test time cost. We add two training only components to small and medium Transformers that also transfer to broader differentiable optimizers. First, a length aware attention prior built via fuzzy regime position alignment, RPA, yields a normalized pre softmax bias that guides attention like a structured regularizer while adding no new inference parameters. Second, a minimal gain aware controller, Guardian, nudges attention sharpness only when validation improvements warrant it, following a two timescale policy gradient view of nonconvex optimization. It is disabled at inference. A KL perspective shows softmax of z plus log pi as MAP with KL regularization, grounding the prior in a principled objective. Under strict compute parity on WikiText 2, we reduce validation cross…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning