QTALE: Quantization-Robust Token-Adaptive Layer Execution for LLMs

Kanghyun Noh; Jinheon Choi; Yulhwa Kim

arXiv:2602.10431·cs.LG·February 26, 2026

QTALE: Quantization-Robust Token-Adaptive Layer Execution for LLMs

Kanghyun Noh, Jinheon Choi, Yulhwa Kim

PDF

Open Access 3 Reviews

TL;DR

QTALE is a framework that combines token-adaptive layer execution with quantization in large language models, maintaining accuracy while reducing computational and memory costs for efficient deployment.

Contribution

QTALE introduces a novel training and post-training mechanism that enables seamless integration of token-adaptive execution with quantization, preserving model accuracy.

Findings

01

Achieves less than 0.5% accuracy gap on CommonsenseQA benchmarks.

02

Enables flexible adjustment of execution ratio at inference.

03

Reduces FLOPs and memory footprint simultaneously.

Abstract

Large language models (LLMs) demand substantial computational and memory resources, posing challenges for efficient deployment. Two complementary approaches have emerged to address these issues: token-adaptive layer execution, which reduces floating-point operations (FLOPs) by selectively bypassing layers, and quantization, which lowers memory footprint by reducing weight precision. However, naively integrating these techniques leads to additional accuracy degradation due to reduced redundancy in token-adaptive models. We propose QTALE (Quantization-Robust Token-Adaptive Layer Execution for LLMs), a novel framework that enables seamless integration of token-adaptive execution with quantization while preserving accuracy. Conventional token-adaptive methods reduce redundancy in two ways: (1) by limiting the diversity of training paths explored during fine-tuning, and (2) by lowering the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The paper targets a timely issue: reducing LLM inference costs in both computation and memory. It correctly observes that token-adaptive skipping and low-bit quantization are complementary, but the naive combination fails. Addressing this gap is practically important. The proposed methods are clearly described and sound. Introducing an entropy regularizer on router outputs to maintain path diversity is a principled idea (inspired by dropout/stochastic-depth). The inference-time threshold mechan

Weaknesses

The ideas, while useful, are relatively incremental. Entropy or diversity regularization is a known trick (e.g., in stochastic-depth, dropout), and thresholding probabilities is conceptually simple. It would strengthen the paper if the authors discussed related techniques in conditional computation or prior gating papers more explicitly. As it stands, the novelty claim rests on applying these ideas to the quantization setting. QTALE introduces extra hyperparameters (entropy weight λ₂ and thresh

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper is well-structured and clearly written. The authors effectively illustrate the key challenges through intuitive figures and provide corresponding solutions that are logically motivated and easy to follow. 2. The paper offers a thorough background discussion, particularly in Section 2.2, which helps readers who may not be familiar with token-adaptive layer execution to understand the motivation and technical context of the work. 3. The authors conduct detailed ablation experiments to

Weaknesses

1. The motivation for QTALE could be further elaborated. In Section 3.1, the authors identify reduced training-path redundancy as a property of token-adaptive models, but it remains unclear why this phenomenon specifically leads to accuracy degradation when integrating token-adaptive layer execution with quantization. Providing either a more detailed theoretical explanation or supporting empirical evidence would make the motivation more convincing. 2. The connection between the proposed QTALE fr

Reviewer 03Rating 6Confidence 4

Strengths

* The motivation is clear, addressing a practical but overlooked incompatibility between quantization and token-adaptive execution. * The proposed solution is conceptually simple yet grounded in a solid understanding of redundancy loss in adaptive models. * Strong empirical coverage across multiple datasets and model scales; results consistently show improved robustness under quantization compared to D-LLM. * Inference-time controllability via a single threshold provides flexibility for deplo

Weaknesses

* The method’s generality beyond AWQ quantization and LLaMA-based architectures remains untested. * The claim of being “quantization-robust” would be stronger if supported by additional analysis under more aggressive or diverse quantization regimes (e.g., 2-bit or mixed-precision settings), even if such configurations are mainly diagnostic rather than practical for deployment.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Machine Learning in Materials Science · Topic Modeling