Compute Where it Counts: Self Optimizing Language Models

Yash Akhauri; Mohamed S. Abdelfattah

arXiv:2605.10875·cs.LG·May 12, 2026

Compute Where it Counts: Self Optimizing Language Models

Yash Akhauri, Mohamed S. Abdelfattah

PDF

TL;DR

This paper introduces Self-Optimizing Language Models (SOL), which dynamically allocate computational resources during decoding by learning efficiency actions, improving quality within fixed budgets.

Contribution

The paper proposes a novel method combining a frozen LLM with a lightweight policy network to adaptively control inference efficiency actions during decoding.

Findings

01

SOL outperforms static and random schedules in quality at the same budget.

02

SOL achieves up to 7.3% accuracy improvement on MMLU over uniform strategies.

03

The approach discovers a better quality-efficiency Pareto front across experiments.

Abstract

Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can over-compute on easy steps and under-compute on hard ones. We study dynamic budget allocation for autoregressive decoding: learning how much computation to spend per token from within a single model. Self-Optimizing Language Models (SOL) pair a frozen LLM with a lightweight policy network that reads the LLM hidden state and selects a discrete efficiency action at each decode step. Actions can jointly control (i) token-level attention sparsity, (ii) structured activation pruning in the MLP, and (iii) activation quantization bit-width, while leaving the base model weights unchanged. We train…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.