An Information Theoretic Perspective on Agentic System Design
Shizhe He, Avanika Narayan, Ishan S. Khare, Scott W. Linderman, Christopher R\'e, Dan Biderman

TL;DR
This paper applies information theory to analyze and optimize agentic language model systems, demonstrating that larger compressors improve performance and efficiency, guiding better design choices across datasets and models.
Contribution
It introduces an information-theoretic framework using mutual information to evaluate compressor quality, providing empirical evidence that larger compressors enhance accuracy and efficiency.
Findings
Larger compressors convey more mutual information per token.
Scaling compressors yields greater performance gains than scaling predictors.
Local compressors can achieve near-frontier accuracy at reduced costs.
Abstract
Agentic language model (LM) systems power modern applications like "Deep Research" and "Claude Code," and leverage multi-LM architectures to overcome context limitations. Beneath their apparent diversity lies a recurring pattern: smaller "compressor" LMs (that can even run locally) distill raw context into compact text that is then consumed by larger "predictor" LMs. Despite their popularity, the design of compressor-predictor systems remains largely ad hoc, with little guidance on how compressor and predictor choices shape downstream performance. In practice, attributing gains to compression versus prediction requires costly, task-specific pairwise sweeps. We argue that these agentic system design questions are, at root, information-theoretic. Viewing the compressor LM as a noisy channel, we introduce a simple estimator of mutual information between the context and its compression to…
Peer Reviews
Decision·ICLR 2026 Poster
Applying MI and rate–distortion theory to compressor–predictor pipelines is interesting and provides a principled view on design trade-offs. The MI estimator is implementable on real inference stacks; could directly help practitioners evaluate compressors without full end-to-end sweeps. The four distilled principles are easy to interpret and potentially impactful in industry.
- The information-theoretic analysis stops at empirical correlation and heuristic exponential fitting. No formal connection is established between MI and downstream accuracy beyond observed correlation - No ablation quantifying the error or variance of the MI estimator. - Experiments focus on one-shot, GPT-style instruction models; no evidence for robustness to multi-turn reasoning agents, tool-using agents, or MoE architectures. - Some datasets or QA pairs are synthetic (GPT-generated), which m
1. The mutual information estimator offers a practical tool for evaluating compression without requiring downstream tasks. It computes effectively using modern inference servers. This method provides insights comparable to perplexity for predictors. 2. Rate-distortion analysis establishes strong correlations between information rate and task performance. It serves as a reliable proxy for system efficacy. The framework guides optimization of communication in agentic designs. 3. Empirical find
1. Reliance on proxy models for mutual information estimation in smaller LMs introduces potential biases. The approximation may not fully capture information dynamics. This could compromise the accuracy of task-agnostic evaluations. 2. Restriction to non-reasoning GPT-style models limits the study's scope. Reasoning tokens require separate analysis not covered here. The postponement leaves gaps in applying findings to advanced agentic architectures. 3. Use of subsampled or synthetic datasets
1. Empirical results demonstrate clear scaling laws favoring larger compressors. Larger models produce more concise yet informative summaries, leading to sublinear compute cost increases. These findings offer practical guidance for optimizing agentic systems in resource-constrained environments. 2. The rate-distortion analysis reveals strong correlations between information rate and accuracy. Bit efficiency metrics predict performance with high fidelity, as shown by R-squared values up to 0.71.
1. The mutual information estimator relies on proxy models for smaller LMs. This introduces potential biases in log-probability evaluations. Such approximations may not fully capture the true information content. 2. Analysis is limited to non-reasoning GPT-style models. This restricts generalizability to reasoning-augmented or multi-turn agentic systems. Future work is needed to extend to more complex architectures. 3. The framework assumes single-round communication. This overlooks iterative
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Natural Language Processing Techniques · Ferroelectric and Negative Capacitance Devices
