Preprocessing Source Code Comments for Linguistic Models
Sergey Matskevich, Colin S. Gordon

TL;DR
This paper investigates the quality and structure of Python comments from large open source datasets to inform better preprocessing techniques for linguistic models used in code understanding and generation.
Contribution
It provides an analysis of comment quality and structure, highlighting the effects of different filtering methods on training and evaluating comment-based models.
Findings
Comments vary significantly in structure and quality.
Naive filtering may discard useful comment information.
In-depth filtering improves the suitability of comments for model training.
Abstract
Comments are an important part of the source code and are a primary source of documentation. This has driven interest in using large bodies of comments to train or evaluate tools that consume or produce them -- such as generating oracles or even code from comments, or automatically generating code summaries. Most of this work makes strong assumptions about the structure and quality of comments, such as assuming they consist mostly of proper English sentences. However, we know little about the actual quality of existing comments for these use cases. Comments often contain unique structures and elements that are not seen in other types of text, and filtering or extracting information from them requires some extra care. This paper explores the contents and quality of Python comments drawn from 840 most popular open source projects from GitHub and 8422 projects from SriLab dataset, and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Software Engineering Research · Topic Modeling
