Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards

Shirley Wu; Parth Sarthi; Shiyu Zhao; Aaron Lee; Herumb Shandilya; Adrian Mladenic Grobelnik; Nurendra Choudhary; Eddie Huang; Karthik Subbian; Linjun Zhang; Diyi Yang; James Zou; Jure Leskovec

arXiv:2507.03041·cs.LG·February 10, 2026

Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards

Shirley Wu, Parth Sarthi, Shiyu Zhao, Aaron Lee, Herumb Shandilya, Adrian Mladenic Grobelnik, Nurendra Choudhary, Eddie Huang, Karthik Subbian, Linjun Zhang, Diyi Yang, James Zou, Jure Leskovec

PDF

Open Access 3 Reviews

TL;DR

Optimas introduces a unified framework for optimizing complex compound AI systems by aligning local rewards with global performance, enabling effective independent component updates and outperforming existing methods.

Contribution

The paper proposes a novel approach that maintains local reward functions for each component, ensuring their alignment with overall system performance, which improves optimization of heterogeneous compound AI systems.

Findings

01

Optimas achieves an average of 11.92% performance improvement across five real-world systems.

02

The framework enables independent updates of diverse system components while maintaining global alignment.

03

Extensive evaluations demonstrate Optimas's effectiveness over strong baseline methods.

Abstract

Compound AI systems integrating multiple components, such as Large Language Models, specialized tools, and traditional machine learning models, are increasingly deployed to solve complex real-world tasks. However, optimizing compound systems remains challenging due to their non-differentiable structures and diverse configuration types across components, including prompts, hyperparameters, and model parameters. To address this challenge, we propose Optimas, a unified framework for effective optimization of compound systems. The core idea of Optimas is to maintain one Local Reward Function (LRF) per component, each satisfying a local-global alignment property, i.e., each component's local reward correlates with the global system performance. In each iteration, Optimas efficiently adapts the LRFs to maintain this property while simultaneously maximizing each component's local reward. This…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1. The presentation is clear and well-organized. 2. The addressed problem—optimizing a compound AI system that could have heterogeneous configurations as a whole rather than its parts independently—is timely and practically important. 3. Theoretical analysis under some assumptions provides partial guarantees for the effectiveness of the method. 4. Empirical evaluation is comprehensive and convincingly supports the method’s effectiveness and mechanism.

Weaknesses

There appear to be some problems in Theorem 4.1 and Lemma B.1. Specifically, Equation 4 defines a loss with a leading minus sign, so Theorem 4.1 should refer to the **minimizer** of Equation. 4, not the **maximizer**. Similarly, in Lemma B.1 the optimization should be formulated as an **argmax** of the expected log-likelihood. The proof shall remain conceptually sound after correcting these signs. Minor typo issues: - Line 94: “optimized” → “optimize” - Line 119: add comma before “While” - Line

Reviewer 02Rating 6Confidence 2

Strengths

1. The paper clearly motivates why compound AI systems are hard (non-differentiable, heterogeneous knobs) and frames a practical local–global alignment objective. 2. The paper unifies optimization across prompts, hyperparameters, model selection, and model parameters within one iterative loop. 3. The paper offers interpretability hooks—probing LRF preferences (e.g., brevity bias matching F1) to explain why updates help.

Weaknesses

While OPTIMAS reduces the number of full system runs, it introduces non-trivial computational overhead from (i) training and adapting the shared 8B-backbone Local Reward Functions (via LoRA) and (ii) generating preference labels that require downstream sampling to estimate expected global rewards. In settings that enable parameter fine-tuning (e.g., PPO for some modules), total token and wall-clock costs may exceed prompt-only baselines. The paper would be strengthened by a cost breakdown (token

Reviewer 03Rating 4Confidence 4

Strengths

1. The presentation of the paper is good. I could quickly understand what their method is about. 2. Their method makes sense from an intuitive point of view - since we cannot do end-to-end learning of the system (due to perhaps, non-differentiable components), we can slowly adjust each component so that it improves the overall system (by looking at how each local step affects the system output. Despite so, there are some flaws in this approach (see weaknesses). 3. Experimental results are compre

Weaknesses

1. The method is intuitive. However, it has several flaws. The biggest flaw is that it seems to be performing some form of local parameter updates with respect to a non-differentiable reward signal. As far as I know, this has the same flavour as popular gradient-free training such as REINFORCE. Why didn't the authors mention such approaches or use it as the baselines (I'm quite sure we can use it as a baselines). 2. In addition, what makes this method slightly worse than prior approaches such as

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification