ProCQA: A Large-scale Community-based Programming Question Answering   Dataset for Code Search

Zehan Li; Jianfei Zhang; Chuantao Yin; Yuanxin Ouyang; Wenge Rong

arXiv:2403.16702·cs.CL·March 26, 2024·1 cites

ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search

Zehan Li, Jianfei Zhang, Chuantao Yin, Yuanxin Ouyang, Wenge Rong

PDF

Open Access 1 Repo

TL;DR

ProCQA is a large-scale, community-sourced dataset for code question answering that enhances code retrieval models through a novel contrastive pre-training approach, leading to improved alignment of text and code representations.

Contribution

The paper introduces ProCQA, a new large-scale dataset from StackOverflow, and a modality-agnostic contrastive pre-training method for better text-code alignment.

Findings

01

Significant performance improvements on code retrieval benchmarks.

02

Effective alignment of text and code representations.

03

Demonstrated advantages over previous bimodal and unimodal pre-training methods.

Abstract

Retrieval-based code question answering seeks to match user queries in natural language to relevant code snippets. Previous approaches typically rely on pretraining models using crafted bi-modal and uni-modal datasets to align text and code representations. In this paper, we introduce ProCQA, a large-scale programming question answering dataset extracted from the StackOverflow community, offering naturally structured mixed-modal QA pairs. To validate its effectiveness, we propose a modality-agnostic contrastive pre-training approach to improve the alignment of text and code representations of current code language models. Compared to previous models that primarily employ bimodal and unimodal pairs extracted from CodeSearchNet for pre-training, our model exhibits significant performance improvements across a wide range of code retrieval benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jordane95/procqa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques

MethodsALIGN