Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference
Ziming Dong, Hardik Sharma, Evan O'Toole, Jaya Prakash Champati, Kui Wu

TL;DR
This paper introduces LLM Shepherding, a cost-effective inference framework that uses short hints from LLMs to improve SLM accuracy and reduce costs significantly, outperforming existing routing and cascading methods.
Contribution
The paper proposes a novel token-level budget control framework for SLM-LLM collaboration, improving cost efficiency and accuracy over prior methods.
Findings
Reduces inference costs by 42-94% across benchmarks.
Achieves up to 2.8x cost reduction compared to baselines.
Effectively improves SLM accuracy with minimal LLM hints.
Abstract
Large Language Models (LLMs) deliver state-of-the-art performance on complex reasoning tasks, but their inference costs limit deployment at scale. Small Language Models (SLMs) offer dramatic cost savings yet lag substantially in accuracy. Existing approaches - routing and cascading - treat the LLM as an all-or-nothing resource: either the query bypasses the LLM entirely, or the LLM generates a complete response at full cost. We introduce LLM Shepherding, a framework that requests only a short prefix (a hint) from the LLM and provides it to SLM. This simple mechanism is surprisingly effective for math and coding tasks: even hints comprising 10-30% of the full LLM response improve SLM accuracy significantly. Shepherding generalizes both routing and cascading, and it achieves lower cost under oracle decision-making. We develop a two-stage predictor that jointly determines whether a hint is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
