Inference-Time Distillation: Cost-Efficient Agents Without Fine-Tuning or Manual Prompt Engineering

Vishnu Sarukkai; Asanshay Gupta; James Hong; Micha\"el Gharbi; Kayvon Fatahalian

arXiv:2512.02543·cs.LG·April 21, 2026

Inference-Time Distillation: Cost-Efficient Agents Without Fine-Tuning or Manual Prompt Engineering

Vishnu Sarukkai, Asanshay Gupta, James Hong, Micha\"el Gharbi, Kayvon Fatahalian

PDF

TL;DR

This paper introduces a cost-efficient inference-time distillation method for large language model agents that maintains agility by avoiding fine-tuning and prompt engineering, leveraging retrieval and fallback strategies.

Contribution

It demonstrates a novel inference-time distillation approach using retrieval and self-consistency cascades to reduce costs while preserving accuracy without retraining or prompt engineering.

Findings

01

Achieves 2.5x cost reduction on ALFWorld with maintained accuracy

02

Attains 3.5x cost reduction on AppWorld, recovering 79% of teacher accuracy

03

Provides empirical guidance on key design choices for cost-performance tradeoffs

Abstract

Deploying LLM agents at scale typically requires choosing between quality and cost. Existing cost-reduction approaches fail to preserve agility: the ability to iterate rapidly without human time bottlenecks. Prompt engineering is brittle and slows iteration, while fine-tuning requires multi-day training and commitment to fixed designs; both are impractical for iterative workflows and time-sensitive batch jobs. We demonstrate that established inference-time techniques--dynamic in-context learning and self-consistency cascades--can be leveraged to shift the cost-accuracy Pareto frontier while preserving agility. Practitioners run the teacher on a small task subset to collect demonstrations, then immediately deploy a cheaper student on the remainder. At each step, the system retrieves relevant teacher demonstrations as in-context examples. When multiple student samples agree, we proceed;…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.