Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference

Ziming Dong; Hardik Sharma; Evan O'Toole; Jaya Prakash Champati; Kui Wu

arXiv:2601.22132·cs.LG·January 30, 2026

Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference

Ziming Dong, Hardik Sharma, Evan O'Toole, Jaya Prakash Champati, Kui Wu

PDF

Open Access

TL;DR

This paper introduces LLM Shepherding, a cost-effective inference framework that uses short hints from LLMs to improve SLM accuracy and reduce costs significantly, outperforming existing routing and cascading methods.

Contribution

The paper proposes a novel token-level budget control framework for SLM-LLM collaboration, improving cost efficiency and accuracy over prior methods.

Findings

01

Reduces inference costs by 42-94% across benchmarks.

02

Achieves up to 2.8x cost reduction compared to baselines.

03

Effectively improves SLM accuracy with minimal LLM hints.

Abstract

Large Language Models (LLMs) deliver state-of-the-art performance on complex reasoning tasks, but their inference costs limit deployment at scale. Small Language Models (SLMs) offer dramatic cost savings yet lag substantially in accuracy. Existing approaches - routing and cascading - treat the LLM as an all-or-nothing resource: either the query bypasses the LLM entirely, or the LLM generates a complete response at full cost. We introduce LLM Shepherding, a framework that requests only a short prefix (a hint) from the LLM and provides it to SLM. This simple mechanism is surprisingly effective for math and coding tasks: even hints comprising 10-30% of the full LLM response improve SLM accuracy significantly. Shepherding generalizes both routing and cascading, and it achieves lower cost under oracle decision-making. We develop a two-stage predictor that jointly determines whether a hint is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications