CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking
Tarun Suresh, Revanth Gangi Reddy, Yifei Xu, Zach Nussbaum, Andriy, Mulyar, Brandon Duderstadt, Heng Ji

TL;DR
CoRNStack is a large, high-quality contrastive dataset for code that improves code retrieval and reranking, leading to better performance in complex real-world software tasks like bug localization.
Contribution
We introduce CoRNStack, a curated contrastive dataset for code, enabling state-of-the-art retrieval and reranking models across multiple programming languages.
Findings
State-of-the-art performance in code retrieval tasks.
Significant improvement in code reranking quality.
Enhanced bug localization in GitHub repositories.
Abstract
Effective code retrieval plays a crucial role in advancing code generation, bug fixing, and software maintenance, particularly as software systems increase in complexity. While current code embedding models have demonstrated promise in retrieving code snippets for small-scale, well-defined tasks, they often underperform in more demanding real-world applications such as bug localization within GitHub repositories. We hypothesize that a key issue is their reliance on noisy and inconsistent datasets for training, which impedes their ability to generalize to more complex retrieval scenarios. To address these limitations, we introduce CoRNStack, a large-scale, high-quality contrastive training dataset for code that spans multiple programming languages. This dataset is curated using consistency filtering to eliminate noisy positives and is further enriched with mined hard negatives, thereby…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗lightonai/LateOn-Code-edgemodel· 3.3k dl· ♡ 263.3k dl♡ 26
- 🤗nomic-ai/nomic-embed-codemodel· 99k dl· ♡ 11899k dl♡ 118
- 🤗nomic-ai/CodeRankLLMmodel· 2.7k dl· ♡ 212.7k dl♡ 21
- 🤗lightonai/LateOn-Codemodel· 215 dl· ♡ 25215 dl♡ 25
- 🤗nomic-ai/CodeRankEmbedmodel· 203k dl· ♡ 57203k dl♡ 57
- 🤗nomic-ai/nomic-embed-code-GGUFmodel· 1.6k dl· ♡ 141.6k dl♡ 14
- 🤗Mungert/nomic-embed-code-GGUFmodel· 654 dl· ♡ 1654 dl♡ 1
- 🤗swankier/nomic-embed-codemodel· 10 dl10 dl
- 🤗lightonai/LateOn-Code-pretrainmodel· 23 dl· ♡ 223 dl♡ 2
- 🤗lightonai/LateOn-Code-edge-pretrainmodel· 8 dl· ♡ 38 dl♡ 3
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Machine Learning in Bioinformatics
