Token Sugar: Making Source Code Sweeter for LLMs through Token-Efficient Shorthand
Zhensu Sun, Chengran Yang, Xiaoning Du, Zhou Yang, Li Li, David Lo

TL;DR
Token Sugar introduces a method to replace common verbose code patterns with reversible shorthands, significantly reducing token counts in source code and during LLM generation without sacrificing performance.
Contribution
The paper presents a systematic approach to mine high-frequency code patterns and create reversible shorthands, enhancing token efficiency at the semantic level for LLM training.
Findings
Up to 15.1% token reduction in source code
Up to 11.2% token savings during generation
Maintains near-identical Pass@1 scores
Abstract
Large language models (LLMs) have shown exceptional performance in code generation and understanding tasks, yet their high computational costs hinder broader adoption. One important factor is the inherent verbosity of programming languages, such as unnecessary formatting elements and lengthy boilerplate code. This leads to inflated token counts in both input and generated outputs, which increases inference costs and slows down the generation process. Prior work improves this through simplifying programming language grammar, reducing token usage across both code understanding and generation tasks. However, it is confined to syntactic transformations, leaving significant opportunities for token reduction unrealized at the semantic level. In this work, we propose Token Sugar, a concept that replaces frequent and verbose code patterns with reversible, token-efficient shorthand in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Machine Learning in Materials Science
