CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow
Nathana\"el Beau, Beno\^it Crabb\'e

TL;DR
CodeInsight presents a curated dataset of 3,409 Python coding examples from Stack Overflow, including intent, code snippets, and unit tests, designed to improve code generation models and analyze their strengths and weaknesses.
Contribution
The paper introduces a new, high-quality dataset for code generation with detailed annotations and contamination reduction, enabling better model training and evaluation.
Findings
Models show varied performance across different task categories.
Refined dataset reduces data contamination, improving evaluation reliability.
GPT-4 performance analyzed on the dataset.
Abstract
We introduce a novel dataset tailored for code generation, aimed at aiding developers in common tasks. Our dataset provides examples that include a clarified intent, code snippets associated, and an average of three related unit tests. It encompasses a range of libraries such as \texttt{Pandas}, \texttt{Numpy}, and \texttt{Regex}, along with more than 70 standard libraries in Python code derived from Stack Overflow. Comprising 3,409 crafted examples by Python experts, our dataset is designed for both model finetuning and standalone evaluation. To complete unit tests evaluation, we categorize examples in order to get more fine grained analysis, enhancing the understanding of models' strengths and weaknesses in specific coding tasks. The examples have been refined to reduce data contamination, a process confirmed by the performance of three leading models: Mistral 7B, CodeLLaMa 13B, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Machine Learning and Data Classification · Software Engineering Research
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Dropout · Dense Connections
