On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery

Renpu Liu; Ruida Zhou; Cong Shen; Jing Yang

arXiv:2410.13981·cs.LG·November 18, 2025

On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery

Renpu Liu, Ruida Zhou, Cong Shen, Jing Yang

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that Transformers can perform learning-to-optimize algorithms for sparse recovery tasks, providing theoretical guarantees and showing advantages over traditional methods in generalization, convergence, and flexibility.

Contribution

It proves that Transformers can implement L2O algorithms with linear convergence and generalize across different measurement matrices and demonstration lengths.

Findings

01

Transformers can perform L2O algorithms with provable linear convergence.

02

Trained Transformers generalize across different measurement matrices.

03

Transformers leverage structural information to accelerate convergence.

Abstract

An intriguing property of the Transformer is its ability to perform in-context learning (ICL), where the Transformer can solve different inference tasks without parameter updating based on the contextual information provided by the corresponding input-output demonstration pairs. It has been theoretically proved that ICL is enabled by the capability of Transformers to perform gradient-descent algorithms (Von Oswald et al., 2023a; Bai et al., 2024). This work takes a step further and shows that Transformers can perform learning-to-optimize (L2O) algorithms. Specifically, for the ICL sparse recovery (formulated as LASSO) tasks, we show that a K-layer Transformer can perform an L2O algorithm with a provable convergence rate linear in K. This provides a new perspective explaining the superior ICL capability of Transformers, even with only a few layers, which cannot be achieved by the…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

**Strengths:** 1. **Strong Theoretical Contributions:** The paper presents a compelling theoretical claim addressing a crucial aspect of machine learning with significant potential for diverse applications. Moreover, the proof techniques introduced are versatile and can be extended to other domains beyond sparse vector recovery, such as compressed sensing and various inverse problems. 2. **Clarity and Rigor of Proofs:** The proofs are well-written and easy to follow, both in the main

Weaknesses

The hypothesis outlined in lines 62-65 appears somewhat disconnected from the application to sparse recovery. The connection between in-context learning (ICL) and sparse recovery is not clearly established, making it challenging to understand why this particular relationship is being explored.

Reviewer 02Rating 6Confidence 2

Strengths

1. The theoretical insights on the versatility of the transformer architecture are interesting and useful for a broader class of problems. For example, the fact that LISTA-type algorithms can be implemented using a transformer opens up the possibility of using transformers for building more powerful learned reconstruction operators for more general inverse problems. 2. The convergence rate results (for sparse estimation and prediction) place the proposed approach on a concrete theoretical foot

Weaknesses

1. The notion of ICL in the context of sparse recovery is not very well explained. I would suggest rewriting Sec. 4.1 with a better explanation of the setup (i.e., exactly what the model is trained on and what exactly is given as input to the model during inference). 2. The convergence result (Theorem 5.1) merely ascertains the existence of a set of parameters such that the recovery is accurate with high probability, and does not state anything about the convergence of a pre-trained transforme

Reviewer 03Rating 8Confidence 4

Strengths

The manuscript is articulated with clarity and is accessible for comprehension. The proposed hypothesis presents a valuable avenue for in-depth analysis. Preliminary concepts are provided with comprehensive equations and definitions. Elaborate proofs are accessible in the appendix. The quality of English is commendable and articulated naturally.

Weaknesses

I made a rejection at this stage as the following shortbacks, I am willing to change my score after the rebuttal phase with respect to the author's responses. 1. The entire work is divided. The primary concept, prove that transformer has the L2O approximation capability in forward process, is isolated to the main task, sparse recovery problem. In this version ,this paper tends to mash-up the L2O approximation on transformer and sparse recovery problem. Essentially, each of them deserves to publi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection

MethodsAttention Is All You Need · Dense Connections · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Linear Layer · Softmax · Multi-Head Attention · Dropout