CLARC: C/C++ Benchmark for Robust Code Search

Kaicheng Wang; Liyan Huang; Weike Fang; Weihang Wang

arXiv:2603.04484·cs.SE·March 6, 2026

CLARC: C/C++ Benchmark for Robust Code Search

Kaicheng Wang, Liyan Huang, Weike Fang, Weihang Wang

PDF

Open Access 3 Reviews

TL;DR

CLARC is a new C/C++ benchmark dataset designed to evaluate code search models' robustness, highlighting their reliance on lexical cues over semantic understanding, especially under challenging conditions.

Contribution

We introduce CLARC, a comprehensive C/C++ code search benchmark with real-world queries, robust validation, and stress-testing features to evaluate model robustness beyond superficial cues.

Findings

01

Models show significant performance drops under challenging conditions.

02

Current models rely heavily on lexical features rather than semantic understanding.

03

CLARC is publicly available for benchmarking and research.

Abstract

Efficient code retrieval is critical for developer productivity, yet existing benchmarks largely focus on Python and rarely stress-test robustness beyond superficial lexical cues. To address the gap, we introduce an automated pipeline for code search datasets and present CLARC, a C/C++ benchmark built from real-world GitHub repositories. CLARC contains 1,245 query-code pairs for evaluation and 5,472 pairs for training. The benchmark incorporates LLM-generated natural language queries validated through rigorous human scoring and hypothesis testing. To analyze contextual requirements effectively, our pipeline starts by ensuring code compilability. It then categorizes code snippets by dependency complexity, distinguishing whether the code relies on custom-defined types or helper functions. The pipeline also enables CLARC to stress-test retrieval robustness by introducing challenging…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. CLARC is a **novel benchmark** specifically for C/C++ code search, with unique settings for anonymized identifiers and low-level languages. 2. The experimental design is thorough, with a wide range of models tested across different evaluation settings. The inclusion of low-level language scenarios is a valuable addition. 3. The paper is mostly clear and well-structured, with good use of figures and tables to explain the methodology and results. 4. The work addresses a gap in code search resea

Weaknesses

1. While CLARC is comprehensive, the authors could explore **more complex real-world codebases** to further ensure the dataset's robustness. 2. While the automated benchmark generation pipeline is promising, the paper could offer a more in-depth discussion of how well this approach can scale to other programming languages or larger codebases. 3. The evaluation of low-level languages (Assembly, WebAssembly) is insightful but could benefit from a deeper analysis of why models struggle with such co

Reviewer 02Rating 4Confidence 4

Strengths

The motivation is solid and relevant. Focusing on C/C++ fills a clear gap in existing benchmarks. The dataset being fully compilable enhances reproducibility. The robustness settings are well-designed and informative. The automated data generation pipeline with statistical validation is technically sound. Experimental coverage is broad and clearly demonstrates the weakness of current models.

Weaknesses

1. The dataset is small compared to existing large benchmarks (e.g., CodeSearchNet). 2. The contribution is mainly engineering rather than conceptual. 3. The analysis of why performance drops is shallow, with little insight into model behavior or representation. 4. The LLM-generated query validation focuses on surface quality rather than semantic fidelity. 5. The paper reads more like a dataset report than a research study, and the novelty is limited. Writing is clear but somewhat lengthy an

Reviewer 03Rating 4Confidence 5

Strengths

1. **Fills gap for C/C++ retrieval**: First comprehensive C/C++ robustness benchmark addressing the field's Python bias with real-world compilable code 2. **Systematic robustness evaluation**: Structured testing across identifier anonymization and compilation settings isolates semantic understanding from lexical pattern matching 3. **Scalable automated methodology**: LLM-based query generation with statistical validation enables cost-effective benchmark expansion while reducing knowledge contami

Weaknesses

1. **Questionable prevalence of target language** The paper claims C/C++ represents "industrially prevalent languages," but this assertion lacks supporting evidence. While C/C++ has importance in systems programming, languages like Java, JavaScript, Python, and C# arguably have broader industrial adoption across web development, enterprise applications, and data science. The authors should justify why C/C++ specifically addresses an industrial need, or broaden their claim to acknowledge the more

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Natural Language Processing Techniques