Power and Limitations of Aggregation in Compound AI Systems

Nivasini Ananthakrishnan; Meena Jagadeesan

arXiv:2602.21556·cs.AI·February 26, 2026

Power and Limitations of Aggregation in Compound AI Systems

Nivasini Ananthakrishnan, Meena Jagadeesan

PDF

Open Access 4 Reviews

TL;DR

This paper analyzes how aggregation in compound AI systems can expand the range of outputs beyond individual models, identifying mechanisms and limitations through theoretical and empirical methods.

Contribution

It introduces a formal framework to characterize when aggregation enhances output diversity and provides necessary and sufficient conditions for such improvements.

Findings

01

Aggregation can expand output space via three mechanisms.

02

Necessary and sufficient conditions for elicitability expansion are established.

03

Empirical illustration with LLMs demonstrates theoretical insights.

Abstract

When designing compound AI systems, a common approach is to query multiple copies of the same model and aggregate the responses to produce a synthesized output. Given the homogeneity of these models, this raises the question of whether aggregation unlocks access to a greater set of outputs than querying a single model. In this work, we investigate the power and limitations of aggregation within a stylized principal-agent framework. This framework models how the system designer can partially steer each agent's output through its reward function specification, but still faces limitations due to prompt engineering ability and model capabilities. Our analysis uncovers three natural mechanisms -- feasibility expansion, support expansion, and binding set contraction -- through which aggregation expands the set of outputs that are elicitable by the system designer. We prove that any…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

- The paper addresses a current and growing trend: leveraging multiple (often identical) copies of LLMs in a system to increase power, robustness, or reliability, which is relevant in both research and industry. - Provides clear formalizations and mathematical analysis, precisely characterizing when aggregation of model outputs can or cannot expand the set of desired system-level outputs. - Identifies three distinct mechanisms (feasibility expansion, support expansion, binding set contraction) a

Weaknesses

- The principal-agent model, as used in Kleinberg et al., are used for classification settings (like job applicants and employers) assumes agents are strategic (with natural game-theoretic incentives, e.g., job applicants maximizing personal outcome). In contrast, LLM “agents” in this paper are neither strategic nor adversarial—they are collaborative and operate according to deterministic/reward-guided rules specified by a system designer. This difference is significant: the economic intuition m

Reviewer 02Rating 6Confidence 3

Strengths

The paper presents a mathematical treatment of compound AI aggregation under a principal–agent model. The theoretical derivations are strong and have clear logical structure. The proofs look rigorous and complete. The paper provides theoretical basis for analyzing emergent properties of LLM ensembles and multi-agent systems. It also provides conceptual tools to assess the limits of prompt ensembling.

Weaknesses

The framework presentation is more theoretical and it is not clear to me how directly the results map to practical LLM systems with complex interactions. The paper could benefit from some experimental validations. The mathematical exposition looks precise but may not be accessible to readers. The paper could also benefit from more geometric intuition of the three mechanisms.

Reviewer 03Rating 6Confidence 3

Strengths

1. Important question: Understanding the theoretical limits of compound AI systems is timely and conceptually valuable, given the growing use of ensembles, debate, and multi-agent LLM systems. 2. Theoretical clarity: The framework is clean, extending Kleinberg et al. (2019) to include multiple agents and conic feasibility constraints. The three-mechanism taxonomy is intuitive and could guide future empirical work on ensemble design. 3. Useful characterizations: The necessary-and-sufficient condi

Weaknesses

1. Limited novelty of insight: While the formalization is careful, most conclusions reiterate intuitive ideas (“aggregation helps only if it changes feasibility or coverage”). The paper extends prior principal–agent analyses rather than offering genuinely new conceptual understanding of compound AI. 2. Unrealistic assumptions: The theory models models as deterministic optimizers of smooth reward functions, with linear or conic capability constraints and monotone concave rewards. These abstractio

Reviewer 04Rating 0Confidence 2

Strengths

- **Ambition and clarity of intent.** The paper aims to address an important conceptual question: *why* multi-sample aggregation often improves large language model outputs. In doing so, it attempts to explain test time compute scaling, a known power empirical LLM phenomena. - **Formal framing.** The authors present clear mathematical definitions and formal statements. The structure (definitions, theorems, proofs) is coherent. - **Topic relevance.** Aggregation and inference-time

Weaknesses

This paper has many problems and reflects bad research taste. **1. Implausible assumptions and circular abstractions.** The biggest hindrance to one's comprehension of the paper is that the core modeling assumptions are not defensible: - Each user prompt is assumed to encode a *hidden reward function* that the model optimizes. This is unrealistic both for human communication and for current LM architectures. Tell me: what is the reward function that I am implicitly implying to you when

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Reinforcement Learning in Robotics · Artificial Intelligence in Games