ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search
Zehan Li, Jianfei Zhang, Chuantao Yin, Yuanxin Ouyang, Wenge Rong

TL;DR
ProCQA is a large-scale, community-sourced dataset for code question answering that enhances code retrieval models through a novel contrastive pre-training approach, leading to improved alignment of text and code representations.
Contribution
The paper introduces ProCQA, a new large-scale dataset from StackOverflow, and a modality-agnostic contrastive pre-training method for better text-code alignment.
Findings
Significant performance improvements on code retrieval benchmarks.
Effective alignment of text and code representations.
Demonstrated advantages over previous bimodal and unimodal pre-training methods.
Abstract
Retrieval-based code question answering seeks to match user queries in natural language to relevant code snippets. Previous approaches typically rely on pretraining models using crafted bi-modal and uni-modal datasets to align text and code representations. In this paper, we introduce ProCQA, a large-scale programming question answering dataset extracted from the StackOverflow community, offering naturally structured mixed-modal QA pairs. To validate its effectiveness, we propose a modality-agnostic contrastive pre-training approach to improve the alignment of text and code representations of current code language models. Compared to previous models that primarily employ bimodal and unimodal pairs extracted from CodeSearchNet for pre-training, our model exhibits significant performance improvements across a wide range of code retrieval benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques
MethodsALIGN
