The Vault: A Comprehensive Multilingual Dataset for Advancing Code   Understanding and Generation

Dung Nguyen Manh; Nam Le Hai; Anh T. V. Dau; Anh Minh Nguyen; Khanh; Nghiem; Jin Guo; Nghi D. Q. Bui

arXiv:2305.06156·cs.CL·October 31, 2023·1 cites

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Anh Minh Nguyen, Khanh, Nghiem, Jin Guo, Nghi D. Q. Bui

PDF

Open Access 1 Repo 1 Models 3 Datasets

TL;DR

The Vault is a large, high-quality multilingual code-text dataset designed to improve large language models' ability to understand and generate code, outperforming existing datasets in various coding tasks.

Contribution

The paper introduces The Vault, a new 43-million high-quality multilingual code-text dataset created with advanced extraction methods for training code-focused language models.

Findings

01

Models fine-tuned on The Vault outperform those trained on other datasets.

02

The dataset enhances performance in code generation, search, and summarization.

03

Analysis shows language and docstring effects on model performance.

Abstract

We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. Our extensive evaluations on common coding tasks including code generation, code search and code summarization show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet. We also provide detailed analyses of our datasets to assess the effects of various programming languages and docstrings on the performance of such models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fsoft-ai4code/thevault
noneOfficial

Models

🤗
Fsoft-AIC/Codebert-docstring-inconsistency
model· 4 dl· ♡ 4
4 dl♡ 4

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Software Testing and Debugging Techniques