In-Context Algorithm Emulation in Fixed-Weight Transformers

Jerry Yao-Chieh Hu; Hude Liu; Jennifer Yuntong Zhang; Han Liu

arXiv:2508.17550·cs.LG·September 29, 2025

In-Context Algorithm Emulation in Fixed-Weight Transformers

Jerry Yao-Chieh Hu, Hude Liu, Jennifer Yuntong Zhang, Han Liu

PDF

3 Reviews

TL;DR

This paper proves that fixed-weight Transformers can emulate a wide range of algorithms through in-context prompting, establishing a link between in-context learning and algorithmic universality in large language models.

Contribution

It introduces a formal framework showing fixed-weight Transformers can emulate various algorithms via prompts, without parameter updates, using only attention mechanisms.

Findings

01

Transformers can emulate algorithms like gradient descent and linear regression.

02

Prompt encoding enables universal algorithm emulation in fixed-weight models.

03

Numerical results support the theoretical claims of algorithmic emulation.

Abstract

We prove that a minimal Transformer with frozen weights emulates a broad class of algorithms by in-context prompting. We formalize two modes of in-context algorithm emulation. In the task-specific mode, for any continuous function $f : R \to R$ , we show the existence of a single-head softmax attention layer whose forward pass reproduces functions of the form $f (w^{⊤} x - y)$ to arbitrary precision. This general template subsumes many popular machine learning algorithms (e.g., gradient descent, linear regression, ridge regression). In the prompt-programmable mode, we prove universality: a single fixed-weight two-layer softmax attention module emulates all algorithms from the task-specific class (i.e., each implementable by a single softmax attention) via only prompting. Our key idea is to construct prompts that encode an algorithm's parameters into token…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 4

Strengths

1. The distinction between task-specific and prompt-programmable emulation is a useful conceptual framing. 2. The coverage of residual update forms $f(w^Tx-y)x$ is broad and subsumes many standard learning procedures. 3. The extension from emulating a single-layer attention mechanism to emulating full (linear) networks is conceptually compelling.

Weaknesses

1. Most (if not all) of the results presented in this paper have already been proved in previous works, possibly using deeper—but still fixed—architectures. Results on linear regression have appeared in [1], [2], [3], and those on ridge regression in [1], [4] (using softmax attention instead of linear attention introduces only a small approximation error). In-context algorithm selection was also demonstrated in [1] and [4]—not explicitly, but implicitly through the design of pointer mechanisms

Reviewer 02Rating 6Confidence 4

Strengths

- Foundational Theoretical Value: The formalization of two emulation modes and universal approximation results establish a rigorous basis for viewing ICL as "in-context algorithm emulation," addressing open questions about fixed-weight Transformer flexibility. - Interpretability: Unlike black-box ICL studies, the paper provides a clear mechanism (prompt-encoded parameters + softmax routing) for how frozen models execute algorithms—enabling future work on principled prompt engineering. - Broad

Weaknesses

- Prompt Scalability: The prompt length grows linearly with the weight dimension of the target algorithm (Section 6, Limitations). This limits practicality for high-dimensional algorithms (e.g., deep neural network training), as prompts could become prohibitively long. - Lack of Comparison to Prompt-Tuning: The paper focuses on hand-crafted prompts but does not compare to learned prompt-tuning methods (e.g., Lester et al., 2021). It is unclear how hand-crafted prompts perform relative to learne

Reviewer 03Rating 6Confidence 4

Strengths

The paper is well written, and the intuition is explained in detail. The theoretical results are strong and extend several prior papers on the ICL capabilities of Transformer architectures.

Weaknesses

(1) The innovation over Hu et al. (2025) should be discussed more comprehensively. For the proof of Theorem 3.1, the ideas appear to follow Hu et al. (2025), albeit with substantial technical development. It would be beneficial to clarify which ideas/techniques are inherited from prior work and what are new here. (2) The dimension of the linear layer is not stated explicitly in the theorems. Providing an explicit bound on this dimension (e.g., in Theorem 3.1, in terms of regularity conditions o

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.