Do not copy and paste! Rewriting strategies for code retrieval
Andrea Gurioli, Federico Pennino, Maurizio Gabbrielli

TL;DR
This paper investigates the effectiveness of different code query rewriting strategies using LLMs, introducing diagnostics to predict when rewriting improves retrieval performance across various benchmarks and encoders.
Contribution
It evaluates a hierarchy of rewriting strategies, introduces Delta H as a predictor for rewriting benefits, and provides insights into when LLM rewriting is most effective.
Findings
Full NL rewriting with joint query-corpus augmentation yields the largest retrieval gains.
Corpus-only rewriting degrades performance in about 62% of configurations.
Delta H token entropy reliably predicts when rewriting improves retrieval.
Abstract
Embedding-based code retrieval often suffers when encoders overfit to surface syntax. Prior work mitigates this by using LLMs to rephrase queries and corpora into a normalized style, but leaves two questions open: how much representational shift helps, and when is the per-query LLM call justified? We study a hierarchy of three rewriting strategies: stylistic rephrasing, NL-enriched PseudoCode, and full Natural-Language transcription, under joint query-corpus (QC, online) and corpus-only (C, offline) augmentation, across six CoIR benchmarks, five encoders, and three rewriters spanning independent model families (Qwen, DeepSeek, Mistral). We are the first to evaluate NL-enriched PseudoCode and snippet-level Natural Language as direct retrieval representations, rather than as transient intermediates. Full NL rewriting with QC yields the largest gains (+0.51 absolute NDCG@10 on CT-Contest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
