ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training

Feijiang Han; Xiaodong Yu; Jianheng Tang; Delip Rao; Weihua Du; Lyle Ungar

arXiv:2505.11739·cs.CL·February 12, 2026

ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training

Feijiang Han, Xiaodong Yu, Jianheng Tang, Delip Rao, Weihua Du, Lyle Ungar

PDF

Open Access 3 Reviews

TL;DR

ZeroTuning is a simple, training-free method that enhances large language models by tuning only the initial token's attention, leading to significant performance improvements across multiple tasks without additional training or decoding modifications.

Contribution

The paper introduces ZeroTuning, a novel approach that improves LLMs by applying head-specific attention adjustments to the initial token, eliminating the need for parameter updates.

Findings

01

ZeroTuning improves performance across 15 datasets.

02

It outperforms prior complex methods in accuracy.

03

It maintains gains with quantized inference and longer contexts.

Abstract

Token-level attention tuning, a class of training-free methods including Post-hoc Attention Steering (PASTA) and Attention Calibration (ACT), has emerged as a promising approach for improving frozen LLMs via interpretable interventions. However, these methods rely on auxiliary heuristics to identify important task-specific tokens, which can introduce bias and limit applicability when token importance is ambiguous or when optimized kernels make attention maps inaccessible. We propose a simpler alternative: intervening only on the initial token (e.g., BOS in LLaMA). We theoretically show that adding lightweight biases to this token's attention logits systematically shifts and reshapes downstream attention patterns - an effect amplified by its natural role as an attention sink. Empirically, we find that this tuning can improve LLM performance and better elicit pretrained knowledge, with…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- **Simple and broadly applicable idea**. The notion that adjusting only the first token’s attention can improve diverse tasks is conceptually elegant and easy to integrate, requiring no retraining or architectural modification. - **Comprehensive empirical coverage**. The paper evaluates across multiple datasets, models, and settings (few-shot, quantized, SDPA/FlashAttention), providing reasonable evidence of generality. - **Clarity and presentation**. The paper is clearly written and well-str

Weaknesses

- **Theoretical over-reach**. The claim that BOS scaling “monotonically controls attention entropy” lacks formal proof; the derivation only handles pairwise attention differences, not entropy. This gap weakens the conceptual basis of the unsupervised variant. - **Transductive unsupervised tuning**. The unsupervised version minimizes entropy on test inputs, while baselines are not given equivalent unsupervised access, overstating generalization gains. - **No statistical robustness**. All result

Reviewer 02Rating 6Confidence 3

Strengths

In short: the paper identifies a control lever in large language models (LLMs) - the initial token (such as \<BOS>\) - and shows how modulating its attention yields performance gains. The method is practically appealing since it is lightweight (just a few lines of code to scale attention) and kernel-agnostic. - Provides experimental Analysis on - how scaling the attention weight of the initial token affects the downstream distribution of attention among other tokens, experiments showing that

Weaknesses

Following are some limitations I see in the paper: - Limited model generalization: most of the analysis and findings in section 3 reply on just one model, Llama-3.1-8B-Instruct, raising concern that some of those effects may be model-specific. I would suggest to add some findings for the other models tried as well - Qwen or Deepseek, to show generality. - Huge hyperparameter tuning overhead: This method introduces a huge number of hyper parameters to tune - task specific tuning, layer wise t

Reviewer 03Rating 6Confidence 5

Strengths

1. This work theoretically and empirically demonstrates that the initial tokens can function as a reliable controller for the attention dynamics, which is also strongly related to the next-token prediction entropy. Moreover, the systematic head-wise and layer-wise initial token scaling analysis provides more insights and reliable motivation for the proposed ZeroTuning method. 2. The proposed plug-and-play attention adjustment, ZeroTuning, is simple yet effective and well-motivated both empiric

Weaknesses

1. While Table 5 shows that ZeroTuning improves even with fixed γ and scales with more search, the paper does not quantify the time/energy required for Level-0/1/2 nor its trade-off with accuracy. 2. γ is calibrated per dataset, and its robustness to distribution shifts and mis-specified γ is unclear, and the cost of head classification is not reported as well. 3. Because ZeroTuning controls attention by scaling the initial sink token, its effect can fade or fluctuate over very long contexts

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsSoftmax · Attention Is All You Need