Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow
Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, Graham Neubig

TL;DR
This paper introduces a novel method for mining high-quality aligned natural language and code pairs from Stack Overflow, significantly improving coverage and accuracy over existing heuristics, and demonstrating cross-language applicability.
Contribution
A new approach combining handcrafted and neural network-based features to automatically identify high-quality NL-code pairs from Stack Overflow data.
Findings
Method greatly expands coverage and accuracy of NL-code pair mining
Effective cross-language generalization with minimal labeled data
Improves data quality for code synthesis, retrieval, and summarization tasks
Abstract
For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques
