NaturalCC: A Toolkit to Naturalize the Source Code Corpus
Yao Wan, Yang He, Jian-Guo Zhang, Yulei Sui, Hai Jin, Guandong Xu,, Caiming Xiong, Philip S. Yu

TL;DR
NaturalCC is a versatile toolkit designed to facilitate research in big code analysis by providing efficient, extensible, and user-friendly tools for natural language and programming language tasks, supporting reproducibility and implementation of state-of-the-art models.
Contribution
It introduces an extensible toolkit built on Fairseq and PyTorch that simplifies reproducing and developing big code analysis models with multi-GPU support and user interfaces.
Findings
Supports multiple big code analysis tasks like code completion and comment generation
Enables fast training with multi-GPU and mixed-precision processing
Includes state-of-the-art baseline models for various tasks
Abstract
We present NaturalCC, an efficient and extensible toolkit to bridge the gap between natural language and programming language, and facilitate the research on big code analysis. Using NaturalCC, researchers both from natural language or programming language communities can quickly and easily reproduce the state-of-the-art baselines and implement their approach. NaturalCC is built upon Fairseq and PyTorch, providing (1) an efficient computation with multi-GPU and mixed-precision data processing for fast model training, (2) a modular and extensible framework that makes it easy to reproduce or implement an approach for big code analysis, and (3) a command line interface and a graphical user interface to demonstrate each model's performance. Currently, we have included several state-of-the-art baselines across different tasks (e.g., code completion, code comment generation, and code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Software Testing and Debugging Techniques
