TL;DR
CL4SE introduces a comprehensive benchmark and taxonomy for evaluating how different context types influence various software engineering tasks using large language models.
Contribution
It provides the first standardized framework and dataset for systematic evaluation of SE-specific context learning effects on multiple LLM-driven tasks.
Findings
Context learning improves overall performance by 24.7%.
Procedural context boosts code review accuracy by up to 33%.
Project-specific context enhances code summarization BLEU scores by 14.78%.
Abstract
Context engineering has emerged as a pivotal paradigm for unlocking the potential of Large Language Models (LLMs) in Software Engineering (SE) tasks, enabling performance gains at test time without model fine-tuning. Despite its success, existing research lacks a systematic taxonomy of SE-specific context types and a dedicated benchmark to quantify the heterogeneous effects of different contexts across core SE workflows. To address this gap, we propose CL4SE (Context Learning for Software Engineering), a comprehensive benchmark featuring a fine-grained taxonomy of four SE-oriented context types (interpretable examples, project-specific context, procedural decision-making context, and positive & negative context), each mapped to a representative task (code generation, code summarization, code review, and patch correctness assessment). We construct high-quality datasets comprising over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
