Programming Language Agnostic Mining of Code and Language Pairs with Sequence Labeling Based Question Answering
Changran Hu, Akshara Reddi Methukupalli, Yutong Zhou, Chen Wu, Yubo, Chen

TL;DR
This paper introduces a PL-agnostic sequence labeling approach for mining natural language and programming language pairs from Stack Overflow posts, enabling transferability across multiple PLs and creating a large-scale high-quality dataset.
Contribution
The paper proposes a novel sequence labeling method using BIO tagging for PL-agnostic NL-PL pair extraction, improving transferability and scalability across diverse programming languages.
Findings
SLQA outperforms existing methods on benchmark datasets.
The Lang2Code corpus contains 1.4 million high-quality NL-PL pairs.
SLQA demonstrates strong transferability across multiple programming languages.
Abstract
Mining aligned natural language (NL) and programming language (PL) pairs is a critical task to NL-PL understanding. Existing methods applied specialized hand-crafted features or separately-trained models for each PL. However, they usually suffered from low transferability across multiple PLs, especially for niche PLs with less annotated data. Fortunately, a Stack Overflow answer post is essentially a sequence of text and code blocks and its global textual context can provide PL-agnostic supplementary information. In this paper, we propose a Sequence Labeling based Question Answering (SLQA) method to mine NL-PL pairs in a PL-agnostic manner. In particular, we propose to apply the BIO tagging scheme instead of the conventional binary scheme to mine the code solutions which are often composed of multiple blocks of a post. Experiments on current single-PL single-block benchmarks and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Software Engineering Research · Topic Modeling
