Adaptive Task Vectors for Large Language Models
Joonseong Kang, Soojeong Lee, Subeen Park, Sumin Park, Taero Kim, Jihee Kim, Ryunyi Lee, Kyungwoo Song

TL;DR
This paper introduces Adaptive Task Vectors (ATV), a dynamic method for conditioning large language models on specific tasks by generating task vectors tailored to each input, improving adaptability and generalization over fixed-vector methods.
Contribution
The paper proposes ATV, a novel framework that dynamically generates task vectors conditioned on each input, enhancing LLM adaptation and generalization beyond fixed demonstration-based approaches.
Findings
ATV outperforms fixed-vector methods on unseen tasks.
Theoretical analysis shows ATV's expressiveness exceeds Prefix-Tuning.
ATV is equivalent to LoRA under certain conditions.
Abstract
In-Context Learning (ICL) enables Large Language Models (LLMs) to perform tasks without parameter updates by conditioning on a few demonstrations provided in the prompt. Despite its success, ICL suffers from several limitations, including sensitivity to demonstration order, context length constraints, and computational inefficiency. To address these challenges, task vector-based approaches compress task information into a single vector. However, these methods typically construct task vectors from fixed sets of demonstrations and reuse them across input queries, without conditioning on the specific input. This limitation can lead models to struggle with effective adaptation when the input query is not well aligned with the underlying demonstrations, consequently degrading their generalization performance on unseen tasks. To overcome this limitation, we propose Adaptive Task Vectors…
Peer Reviews
Decision·Submitted to ICLR 2026
- Clear articulation of a simple, modular pipeline that keeps the target LLM frozen while learning a lightweight generator and linear expansion, with a consistent injection interface across layers. - Theoretical framing that precisely scope-limits claims to next-token distributions and gives a clean equivalence-to-LoRA result under matched rank and placements, plus a principled argument for subsuming Prefix-Tuning under a linearized attention view. - Broad empirical sweep over a standardized ELI
- The LoRA equivalence and superiority-over-Prefix claims rest on constrained scopes and approximations: the LoRA equivalence focuses only on next-token distribution with matched placements and static ATV, and the Prefix result relies on a linear attention approximation that may diverge from real softmax attention in practice. - The empirical fairness of baselines is questionable in places. For example, the LoRA setup departs from the original learning rate due to poor performance (adjusted to 4
- Simple and computationally light steering mechanism for frozen LLMs. - Clear motivation to reduce inference-time token overhead compared to ICL. - Interesting empirical finding that early-layer injection performs best. - Theoretical framing situates ATV among PEFT methods (LoRA, prefix, prompt tuning).
- I find the framing of the method somewhat misaligned. Despite its name, the proposed vector is generated per query via supervised training and is not shared across examples or tasks. This makes it a query-specific steering signal rather than a reusable task-level representation. A more appropriate baseline would embed fixed demonstrations once into a shared vector that can be reused across queries, achieving comparable token efficiency while preserving task conditioning. Moreover, the notion o
- Interesting idea: Using a query vector to allow TVs to support different tasks. - There is some theoretical analysis demonstrating the equivalence between ATV and LoRA. - Extensive and promising experiments: results show that ATVs beat both baseline TV methods and LoRA, with slightly slower inference speed compared to I2CL.
- The paper lacks some discussion on the training cost of the extra modules. - According to the paper, the small model is supposed to generate a vector representation of the query. It seems more intuitive to use a language encoder for this purpose. Why does the paper use the decoder-only GPT family models? How does a decoder-only model generate a vector representation? Is it using its last layer hidden state? I would love to see some experiments using encoder-decoder models like the BERT family
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Domain Adaptation and Few-Shot Learning · Topic Modeling
