Do not copy and paste! Rewriting strategies for code retrieval

Andrea Gurioli; Federico Pennino; Maurizio Gabbrielli

arXiv:2605.08299·cs.SE·May 12, 2026

Do not copy and paste! Rewriting strategies for code retrieval

Andrea Gurioli, Federico Pennino, Maurizio Gabbrielli

PDF

TL;DR

This paper investigates the effectiveness of different code query rewriting strategies using LLMs, introducing diagnostics to predict when rewriting improves retrieval performance across various benchmarks and encoders.

Contribution

It evaluates a hierarchy of rewriting strategies, introduces Delta H as a predictor for rewriting benefits, and provides insights into when LLM rewriting is most effective.

Findings

01

Full NL rewriting with joint query-corpus augmentation yields the largest retrieval gains.

02

Corpus-only rewriting degrades performance in about 62% of configurations.

03

Delta H token entropy reliably predicts when rewriting improves retrieval.

Abstract

Embedding-based code retrieval often suffers when encoders overfit to surface syntax. Prior work mitigates this by using LLMs to rephrase queries and corpora into a normalized style, but leaves two questions open: how much representational shift helps, and when is the per-query LLM call justified? We study a hierarchy of three rewriting strategies: stylistic rephrasing, NL-enriched PseudoCode, and full Natural-Language transcription, under joint query-corpus (QC, online) and corpus-only (C, offline) augmentation, across six CoIR benchmarks, five encoders, and three rewriters spanning independent model families (Qwen, DeepSeek, Mistral). We are the first to evaluate NL-enriched PseudoCode and snippet-level Natural Language as direct retrieval representations, rather than as transient intermediates. Full NL rewriting with QC yields the largest gains (+0.51 absolute NDCG@10 on CT-Contest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.