Inference-Time Distillation: Cost-Efficient Agents Without Fine-Tuning or Manual Prompt Engineering
Vishnu Sarukkai, Asanshay Gupta, James Hong, Micha\"el Gharbi, Kayvon Fatahalian

TL;DR
This paper introduces a cost-efficient inference-time distillation method for large language model agents that maintains agility by avoiding fine-tuning and prompt engineering, leveraging retrieval and fallback strategies.
Contribution
It demonstrates a novel inference-time distillation approach using retrieval and self-consistency cascades to reduce costs while preserving accuracy without retraining or prompt engineering.
Findings
Achieves 2.5x cost reduction on ALFWorld with maintained accuracy
Attains 3.5x cost reduction on AppWorld, recovering 79% of teacher accuracy
Provides empirical guidance on key design choices for cost-performance tradeoffs
Abstract
Deploying LLM agents at scale typically requires choosing between quality and cost. Existing cost-reduction approaches fail to preserve agility: the ability to iterate rapidly without human time bottlenecks. Prompt engineering is brittle and slows iteration, while fine-tuning requires multi-day training and commitment to fixed designs; both are impractical for iterative workflows and time-sensitive batch jobs. We demonstrate that established inference-time techniques--dynamic in-context learning and self-consistency cascades--can be leveraged to shift the cost-accuracy Pareto frontier while preserving agility. Practitioners run the teacher on a small task subset to collect demonstrations, then immediately deploy a cheaper student on the remainder. At each step, the system retrieves relevant teacher demonstrations as in-context examples. When multiple student samples agree, we proceed;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
