LLM Program Optimization via Retrieval Augmented Search

Sagnik Anupam; Alexander Shypula; Osbert Bastani

arXiv:2501.18916·cs.LG·February 3, 2025

LLM Program Optimization via Retrieval Augmented Search

Sagnik Anupam, Alexander Shypula, Osbert Bastani

PDF

Open Access 4 Reviews

TL;DR

This paper introduces Retrieval Augmented Search (RAS), a blackbox method for optimizing large language models in programming tasks by retrieving relevant examples, and proposes AEGIS for interpretability through atomic edits, achieving significant performance improvements.

Contribution

The paper presents RAS, a novel retrieval-based optimization method for LLMs, and AEGIS, a technique for interpretability via atomic edits, both outperforming prior approaches.

Findings

01

RAS outperforms previous blackbox strategies by 1.8×.

02

AEGIS improves interpretability and performs 1.37× better.

03

Retrieval based on LLM-generated descriptions is more effective.

Abstract

With the advent of large language models (LLMs), there has been a great deal of interest in applying them to solve difficult programming tasks. Recent work has demonstrated their potential at program optimization, a key challenge in programming languages research. We propose a blackbox adaptation method called Retrieval Augmented Search (RAS) that performs beam search over candidate optimizations; at each step, it retrieves in-context examples from a given training dataset of slow-fast program pairs to guide the LLM. Critically, we find that performing contextual retrieval based on an LLM-generated natural language description significantly outperforms retrieval based on the source code. In addition, we propose a method called AEGIS for improving interpretability by decomposing training examples into "atomic edits" that are significantly more incremental in nature. We show that RAS…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

- **Clear Baselines and Ablations**: Comparisons against state-of-the-art dynamic retrieval, "Instruct Only," and "No Contextual" ablations (using source code retrieval instead of contextual retrieval) effectively isolate the impact of core innovations. For example, RAS achieves an 8.61× speedup on PIE—2.04× better than dynamic retrieval—while the "No Contextual" variant's 3.63× speedup confirms that contextual retrieval is essential. - **Cross-Language Generalizability**: Evaluations on PIE (C+

Weaknesses

- **Effectiveness of Atomic Operation Decomposition in AEGIS:** The paper proposes AEGIS, which decomposes atomic operations in the training dataset. However, experimental results show that AEGIS underperforms RAS. In the "No Contextual" setting, AEGIS's optimization effect falls below Dynamic Retrieval. This raises questions about whether atomic operation decomposition effectively improves optimization performance. While atomic operations enhance interpretability, the paper lacks statistical da

Reviewer 02Rating 2Confidence 3

Strengths

- The core idea of combining iterative search with contextual retrieval is intuitive and well-motivated. - The paper demonstrates substantial improvements over baseline methods, achieving 2× better speedup compared to dynamic retrieval on the PIE benchmark, with consistent gains on the Mercury benchmark as well. - AEGIS's attempt to improve interpretability through atomic edits is an interesting method, even though the performance trade-off is not ideal.

Weaknesses

**1. Writing and presentation issues** The paper's organization and clarity need improvement. For example: - In the "Problem formulation" section (line 152), “Problem formulation. In the program optimization problem, the goal is to take a program $p \in P$ as input, and output an optimized program $p′ \in P$ that is semantically equivalent to p.” spends an entire paragraph discussing semantic equivalence rather than defining the "optimization" (runtime? memory? how is it measured?). - Line 86

Reviewer 03Rating 4Confidence 3

Strengths

1. The proposed RAS is simple yet effective. 2. The proposed RAS shows significant improvements on the benchmark, even surpassing human baselines.

Weaknesses

1. Although the proposed method RAS improves the optimization performance, it requires too many LLM calls ($m \times k$), which is costly in practice. 2. The authors prove that contextual retrieval can outperform code retrieval. However, it seems counterintuitive because the raw code should naturally reflect more details than code descriptions. I wonder if it is just because the embedding model used in the paper cannot handle the code input well. 3. RAS in fact is different from beam search as

Reviewer 04Rating 4Confidence 4

Strengths

+ Clear problem framing for black-box adaptation to performance optimization. + The proposed method is rational, combining contextual retrieval and iterative search to improve the efficiency of the generated code. + Experiments show that AEGIS improves edit granularity and interpretability.

Weaknesses

- Comparisons center on closely related prompt-engineering variants (dynamic retrieval, no-contextual retrieval, instruct-only). Missing are stronger and more diverse baselines: e.g., white-box adaptation (e.g., fine-tuning), stronger compiler optimization pipelines, or more recent LLM-based code optimization methods. As a result, the superiority of RAS may mostly reflect an advantage within a narrow family of RAG methods rather than against the broader state of the art. - The paper does not rep

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Educational Technology and Assessment · Cloud Computing and Resource Management