Compute Where it Counts: Self Optimizing Language Models
Yash Akhauri, Mohamed S. Abdelfattah

TL;DR
This paper introduces Self-Optimizing Language Models (SOL), which dynamically allocate computational resources during decoding by learning efficiency actions, improving quality within fixed budgets.
Contribution
The paper proposes a novel method combining a frozen LLM with a lightweight policy network to adaptively control inference efficiency actions during decoding.
Findings
SOL outperforms static and random schedules in quality at the same budget.
SOL achieves up to 7.3% accuracy improvement on MMLU over uniform strategies.
The approach discovers a better quality-efficiency Pareto front across experiments.
Abstract
Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can over-compute on easy steps and under-compute on hard ones. We study dynamic budget allocation for autoregressive decoding: learning how much computation to spend per token from within a single model. Self-Optimizing Language Models (SOL) pair a frozen LLM with a lightweight policy network that reads the LLM hidden state and selects a discrete efficiency action at each decode step. Actions can jointly control (i) token-level attention sparsity, (ii) structured activation pruning in the MLP, and (iii) activation quantization bit-width, while leaving the base model weights unchanged. We train…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
