Less is More: DocString Compression in Code Generation
Guang Yang, Yu Zhou, Wei Cheng, Xiangyu Zhang, Xiang Chen, Terry Yue Zhuo, Ke Liu, Xin Zhou, David Lo, Taolue Chen

TL;DR
This paper introduces ShortenDoc, a novel DocString compression method for code generation that achieves 25-40% reduction in prompt size while maintaining code quality, thereby improving efficiency and reducing costs in LLM-based software engineering.
Contribution
We propose ShortenDoc, a dedicated compression technique for DocStrings in code generation, outperforming existing methods in preserving code quality at higher compression levels.
Findings
ShortenDoc achieves 25-40% compression on six datasets.
State-of-the-art prompt compression methods only reduce by about 10%.
Using ShortenDoc reduces token processing costs significantly.
Abstract
The widespread use of Large Language Models (LLMs) in software engineering has intensified the need for improved model and resource efficiency. In particular, for neural code generation, LLMs are used to translate function/method signature and DocString to executable code. DocStrings which capture user re quirements for the code and used as the prompt for LLMs, often contains redundant information. Recent advancements in prompt compression have shown promising results in Natural Language Processing (NLP), but their applicability to code generation remains uncertain. Our empirical study show that the state-of-the-art prompt compression methods achieve only about 10% reduction, as further reductions would cause significant performance degradation. In our study, we propose a novel compression method, ShortenDoc, dedicated to DocString compression for code generation. Our extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Intelligent Tutoring Systems and Adaptive Learning · Logic, programming, and type systems
