Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads

Justice Owusu Agyemang; Jerry John Kponyo; Elliot Amponsah; Godfred Manu Addo Boakye; Kwame Opuni-Boachie Obour Agyekum

arXiv:2604.12301·cs.DC·April 15, 2026

Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads

Justice Owusu Agyemang, Jerry John Kponyo, Elliot Amponsah, Godfred Manu Addo Boakye, Kwame Opuni-Boachie Obour Agyekum

PDF

TL;DR

This study evaluates seven tactics to reduce cloud LLM token usage in coding workloads, demonstrating significant savings with workload-dependent tactic combinations.

Contribution

It systematically measures and compares seven tactics for token reduction, providing an open-source implementation and insights into workload-specific effectiveness.

Findings

01

Local routing plus prompt compression saves 45-79% tokens.

02

Full tactic set achieves 51% savings on RAG-heavy workloads.

03

Optimal tactics vary depending on workload type.

Abstract

We present a systematic measurement study of seven tactics for reducing cloud LLM token usage when a small local model can act as a triage layer in front of a frontier cloud model. The tactics are: (1) local routing, (2) prompt compression, (3) semantic caching, (4) local drafting with cloud review, (5) minimal-diff edits, (6) structured intent extraction, and (7) batching with vendor prompt caching. We implement all seven in an open-source shim that speaks both MCP and the OpenAI-compatible HTTP surface, supporting any local model via Ollama and any cloud model via an OpenAI-compatible endpoint. We evaluate each tactic individually, in pairs, and in a greedy-additive subset across four coding-agent workload classes (edit-heavy, explanation-heavy, general chat, RAG-heavy). We measure tokens saved, dollar cost, latency, and routing accuracy. Our headline finding is that T1 (local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.