TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Ditching
Runjia Zeng, Qifan Wang, Qiang Guan, Ruixiang Tang, Lifu Huang, Zhenting Wang, Xueling Zhang, Cheng Han, Dongfang Liu

TL;DR
TokenSeek introduces an instance-aware token ditching method that significantly reduces memory usage during fine-tuning of large language models while maintaining or improving performance, and provides interpretability insights.
Contribution
It proposes a universal plugin for transformer models that employs instance-aware token seeking and ditching, achieving substantial memory savings with stable fine-tuning.
Findings
Requires only 14.8% memory on Llama3.2 1B
Maintains or improves fine-tuning performance
Provides interpretability of token efficiency
Abstract
Fine tuning has been regarded as a de facto approach for adapting large language models (LLMs) to downstream tasks, but the high training memory consumption inherited from LLMs makes this process inefficient. Among existing memory efficient approaches, activation-related optimization has proven particularly effective, as activations consistently dominate overall memory consumption. Although prior arts offer various activation optimization strategies, their data-agnostic nature ultimately results in ineffective and unstable fine tuning. In this paper, we propose TokenSeek, a universal plugin solution for various transformer-based models through instance-aware token seeking and ditching, achieving significant fine-tuning memory savings (e.g., requiring only 14.8% of the memory on Llama3.2 1B) with on-par or even better performance. Furthermore, our interpretable token seeking process…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is clearly written and generally easy to follow, with only minor typos. - The proposed TokenSeek method is conceptually simple and practically implementable. - Experimental results demonstrate competitive performance on multiple downstream tasks (e.g., QA, reasoning) across different LLM architectures such as LLaMA and Qwen. - The experiments are comprehensive and include ablations and cross-task evaluations, which strengthen empirical credibility.
- Limited Novelty Clarification: The main contribution—token importance–based selection—extends the previous TokenTune framework rather than introducing a fully new paradigm. While TokenSeek improves token importance estimation compared to random selection, the core ideas of memory reduction and generalizability are largely inherited from TokenTune. The paper should better articulate what is fundamentally novel about TokenSeek beyond methodological refinements. - Comparison to Low-Rank and Part
Solid problem formulation as fine-tuning large LLMs is highly memory-intensive, with activations contributing a major share of the cost. And Innovative approach to integrates gradient information with context scores to capture a more holistic measure of token importance, addressing the limitation that context-based evaluation alone reflects only intra-sequence relevance, not fine-tuning contribution. Some of the specific strength: - Architecture-agnostic design: The proposed plugin can be ap
- Scalability concerns: The token regrouping step—where tokens are sorted by importance and selectively included for backpropagation—may pose significant implementation and communication challenges in large-scale distributed fine-tuning setups. Synchronizing token importance scores and managing uneven token partitions across devices could offset some of the claimed memory savings. - Complexity of integration: Although conceptually modular, integrating the method into existing large-scale MEFT p
Combines attention-derived context and gradient saliency to rank tokens per example; empirically more effective and more stable than random selection. Reports 2.8 GiB peak in one setting and ~15% of full-token QLoRA peak on Llama-3.2-1B, while maintaining accuracy; cumulative with PEFT (LoHa/QLoRA). isualizations show complementary early-token bias from attention and late-token focus from gradients; helps explain the chosen subset.
Need a controlled knob table for each baseline (checkpointing, offloading, micro-batching, seq length, optimizer sharding) and the resulting peak+average memory to ensure apples-to-apples comparisons. Gradient-based scoring requires a partial backward pass; quantify this overhead per step and analyze action oscillations/instability of the selected set across training. Provide seed variance tables.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
