SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization

Dhruv Gupta; Gayathri Ganesh Lakshmy; Yiqing Xie

arXiv:2506.20081·cs.CL·June 27, 2025

SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization

Dhruv Gupta, Gayathri Ganesh Lakshmy, Yiqing Xie

PDF

Open Access 1 Video

TL;DR

This paper analyzes biases in code retrieval systems, revealing reliance on superficial textual features and bias towards well-documented code, and proposes SACL to mitigate these issues, improving retrieval and code generation performance.

Contribution

Introduces SACL, a semantic-augmented reranking framework that reduces textual bias in code retrieval, enhancing both retrieval accuracy and code generation quality.

Findings

01

Retrievers rely heavily on surface-level textual features.

02

Bias towards well-documented but irrelevant code exists.

03

SACL improves retrieval and code generation metrics significantly.

Abstract

Retrieval-Augmented Code Generation (RACG) is a critical technique for enhancing code generation by retrieving relevant information. In this work, we conduct an in-depth analysis of code retrieval by systematically masking specific features while preserving code functionality. Our discoveries include: (1) although trained on code, current retrievers heavily rely on surface-level textual features (e.g., docstrings, identifier names), and (2) they exhibit a strong bias towards well-documented code, even if the documentation is irrelevant. Based on our discoveries, we propose SACL, a framework that enriches textual information and reduces bias by augmenting code or structural knowledge with semantic information. Extensive experiments show that SACL substantially improves code retrieval (e.g., by 12.8% / 9.4% / 7.0% Recall@1 on HumanEval / MBPP / SWE-Bench-Lite), which also leads to better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification