Archon: An Architecture Search Framework for Inference-Time Techniques
Jon Saad-Falcon, Adrian Gamarra Lafuente, Shlok Natarajan, Nahum Maru, Hristo Todorov, Etash Guha, E. Kelly Buchanan, Mayee Chen, Neel Guha, Christopher R\'e, Azalia Mirhoseini

TL;DR
Archon is a modular framework that automates the optimization of inference-time techniques and LLM configurations, significantly improving model performance across various tasks within given compute budgets.
Contribution
It introduces Archon, a novel automated system for selecting and combining inference-time techniques and LLMs to optimize accuracy and efficiency.
Findings
Archon outperforms top models like GPT-4o and Claude 3.5 Sonnet by 15.1% on average.
It effectively explores large design spaces to tailor configurations for specific benchmarks.
The framework enhances the Pareto frontier of accuracy versus token budget.
Abstract
Inference-time techniques, such as repeated sampling or iterative revisions, are emerging as powerful ways to enhance large-language models (LLMs) at test time. However, best practices for developing systems that combine these techniques remain underdeveloped due to our limited understanding of the utility of each technique across models and tasks, the interactions between them, and the massive search space for combining them. To address these challenges, we introduce Archon, a modular and automated framework for optimizing the process of selecting and combining inference-time techniques and LLMs. Given a compute budget and a set of available LLMs, Archon explores a large design space to discover optimized configurations tailored to target benchmarks. It can design custom or general-purpose architectures that advance the Pareto frontier of accuracy vs. maximum token budget compared to…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The results seem to be generally strong. Even if it requires many calls to the models to yield the final result, the reported performance numbers on a variety of benchmarks are impressive, often matching or exceeding state of the art performance. 2. The authors produce an open sourced framework so that others can reproduce and build on the results.
1. There are some weird formatting issues with the paper. All the characters seem to be compressed compared to other submissions I am reviewing. But also, the bottom margins are substantially enlarged. 2. The comparisons often don't seem to be very fair and baselines are lacking. We should match the FLOP/token/dollar budgets between different methods to see which methods are most effective at turning additional compute into performance. The results are often confounded by comparing across subst
The paper provides an insightful summary of existing inference-time techniques in the LLM field, distilling their concepts into well-structured building blocks that establish a robust paradigm for constructing LLM systems. The proposed framework, ARCHON, demonstrates significant performance improvements on common benchmarks, highlighting its effectiveness. The results from the LLM component interaction experiments are also valuable in real-world practice. Additionally, the presentation is genera
While this work has involved substantial effort, I recommend rejecting the paper for the following reasons: (1) The paper offers few novel insights into inference-time techniques, and the main conclusions drawn from the experiments are rather trivial, relying primarily on parameter tuning without theoretical justification; (2) The experimental setup is neither practical nor comprehensive enough to fully demonstrate the interaction mechanisms of mentioned methods; (3) The proposed framework is of
+ The approach performs well, both in the closed/open model settings, and can produce combinations of open models that reach close to closed model performance + The approach is relatively inexpensive, using approximately 40x inference calls on a single query (though some opinions may differ on whether this is inexpensive) + The resulting architectures found in the appendix are relatively simple, and scaling is easy for certain components
+ The proposed approach is not quite comparable to the baselines in terms of compute. While ARCHON uses 30+ inference calls, other LLM systems are given just one. It would be fairer to uses any single building block but scaled up (if possible) to having approximately the same inference cost, to demonstrate that the combination is truly driving the performance gains as opposed to the # of inference calls. If performance was better even when giving the baselines similar compute, this could help da
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Semantic Web and Ontologies
MethodsSparse Evolutionary Training
