StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow
Ziyu Yao, Daniel S. Weld, Wei-Peng Chen, Huan Sun

TL;DR
This paper introduces StaQC, a large, high-quality dataset of question-code pairs from Stack Overflow, created through a novel neural network approach that improves over heuristic methods, aiding natural language and programming language tasks.
Contribution
It presents a systematic method for mining high-quality question-code pairs and provides the largest dataset of its kind, enhancing research in code retrieval and annotation.
Findings
The proposed neural network outperforms heuristic methods by at least 15% in F1 and accuracy.
StaQC contains approximately 148K Python and 120K SQL question-code pairs.
Case studies show StaQC's effectiveness in developing models for associating natural language with code.
Abstract
Stack Overflow (SO) has been a great source of natural language questions and their code solutions (i.e., question-code pairs), which are critical for many tasks including code retrieval and annotation. In most existing research, question-code pairs were collected heuristically and tend to have low quality. In this paper, we investigate a new problem of systematically mining question-code pairs from Stack Overflow (in contrast to heuristically collecting them). It is formulated as predicting whether or not a code snippet is a standalone solution to a question. We propose a novel Bi-View Hierarchical Neural Network which can capture both the programming content and the textual context of a code snippet (i.e., two views) to make a prediction. On two manually annotated datasets in Python and SQL domain, our framework substantially outperforms heuristic methods with at least 15% higher F1…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
