The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Anh Minh Nguyen, Khanh, Nghiem, Jin Guo, Nghi D. Q. Bui

TL;DR
The Vault is a large, high-quality multilingual code-text dataset designed to improve large language models' ability to understand and generate code, outperforming existing datasets in various coding tasks.
Contribution
The paper introduces The Vault, a new 43-million high-quality multilingual code-text dataset created with advanced extraction methods for training code-focused language models.
Findings
Models fine-tuned on The Vault outperform those trained on other datasets.
The dataset enhances performance in code generation, search, and summarization.
Analysis shows language and docstring effects on model performance.
Abstract
We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. Our extensive evaluations on common coding tasks including code generation, code search and code summarization show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet. We also provide detailed analyses of our datasets to assess the effects of various programming languages and docstrings on the performance of such models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Software Testing and Debugging Techniques
