Archon: An Architecture Search Framework for Inference-Time Techniques

Jon Saad-Falcon; Adrian Gamarra Lafuente; Shlok Natarajan; Nahum Maru; Hristo Todorov; Etash Guha; E. Kelly Buchanan; Mayee Chen; Neel Guha; Christopher R\'e; Azalia Mirhoseini

arXiv:2409.15254·cs.LG·June 12, 2025

Archon: An Architecture Search Framework for Inference-Time Techniques

Jon Saad-Falcon, Adrian Gamarra Lafuente, Shlok Natarajan, Nahum Maru, Hristo Todorov, Etash Guha, E. Kelly Buchanan, Mayee Chen, Neel Guha, Christopher R\'e, Azalia Mirhoseini

PDF

Open Access 1 Repo 3 Reviews

TL;DR

Archon is a modular framework that automates the optimization of inference-time techniques and LLM configurations, significantly improving model performance across various tasks within given compute budgets.

Contribution

It introduces Archon, a novel automated system for selecting and combining inference-time techniques and LLMs to optimize accuracy and efficiency.

Findings

01

Archon outperforms top models like GPT-4o and Claude 3.5 Sonnet by 15.1% on average.

02

It effectively explores large design spaces to tailor configurations for specific benchmarks.

03

The framework enhances the Pareto frontier of accuracy versus token budget.

Abstract

Inference-time techniques, such as repeated sampling or iterative revisions, are emerging as powerful ways to enhance large-language models (LLMs) at test time. However, best practices for developing systems that combine these techniques remain underdeveloped due to our limited understanding of the utility of each technique across models and tasks, the interactions between them, and the massive search space for combining them. To address these challenges, we introduce Archon, a modular and automated framework for optimizing the process of selecting and combining inference-time techniques and LLMs. Given a compute budget and a set of available LLMs, Archon explores a large design space to discover optimized configurations tailored to target benchmarks. It can design custom or general-purpose architectures that advance the Pareto frontier of accuracy vs. maximum token budget compared to…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 3

Strengths

1. The results seem to be generally strong. Even if it requires many calls to the models to yield the final result, the reported performance numbers on a variety of benchmarks are impressive, often matching or exceeding state of the art performance. 2. The authors produce an open sourced framework so that others can reproduce and build on the results.

Weaknesses

1. There are some weird formatting issues with the paper. All the characters seem to be compressed compared to other submissions I am reviewing. But also, the bottom margins are substantially enlarged. 2. The comparisons often don't seem to be very fair and baselines are lacking. We should match the FLOP/token/dollar budgets between different methods to see which methods are most effective at turning additional compute into performance. The results are often confounded by comparing across subst

Reviewer 02Rating 5Confidence 4

Strengths

The paper provides an insightful summary of existing inference-time techniques in the LLM field, distilling their concepts into well-structured building blocks that establish a robust paradigm for constructing LLM systems. The proposed framework, ARCHON, demonstrates significant performance improvements on common benchmarks, highlighting its effectiveness. The results from the LLM component interaction experiments are also valuable in real-world practice. Additionally, the presentation is genera

Weaknesses

While this work has involved substantial effort, I recommend rejecting the paper for the following reasons: (1) The paper offers few novel insights into inference-time techniques, and the main conclusions drawn from the experiments are rather trivial, relying primarily on parameter tuning without theoretical justification; (2) The experimental setup is neither practical nor comprehensive enough to fully demonstrate the interaction mechanisms of mentioned methods; (3) The proposed framework is of

Reviewer 03Rating 6Confidence 3

Strengths

+ The approach performs well, both in the closed/open model settings, and can produce combinations of open models that reach close to closed model performance + The approach is relatively inexpensive, using approximately 40x inference calls on a single query (though some opinions may differ on whether this is inexpensive) + The resulting architectures found in the appendix are relatively simple, and scaling is easy for certain components

Weaknesses

+ The proposed approach is not quite comparable to the baselines in terms of compute. While ARCHON uses 30+ inference calls, other LLM systems are given just one. It would be fairer to uses any single building block but scaled up (if possible) to having approximately the same inference cost, to demonstrate that the combination is truly driving the performance gains as opposed to the # of inference calls. If performance was better even when giving the baselines similar compute, this could help da

Code & Models

Repositories

scalingintelligence/archon
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Database Systems and Queries · Semantic Web and Ontologies

MethodsSparse Evolutionary Training