On the Lexical Distinguishability of Source Code
Martin Velez, Dong Qiu, You Zhou, Earl T. Barr, Zhendong Su

TL;DR
This paper investigates the core words in source code that are essential for distinguishing functions, providing quantitative evidence for keyword-based programming and new insights into programming models.
Contribution
It introduces a method to identify minimal distinguishing subsets of code, called Minsets, demonstrating that functions have core 'wheat' words crucial for understanding.
Findings
Functions contain a small set of key words that distinguish them
Quantitative evidence supports keyword-based programming approaches
Large corpus analysis confirms the presence of minimal distinguishing code subsets
Abstract
Natural language is robust against noise. The meaning of many sentences survives the loss of words, sometimes many of them. Some words in a sentence, however, cannot be lost without changing the meaning of the sentence. We call these words "wheat" and the rest "chaff". The word "not" in the sentence "I do not like rain" is wheat and "do" is chaff. For human understanding of the purpose and behavior of source code, we hypothesize that the same holds. To quantify the extent to which we can separate code into "wheat" and "chaff", we study a large (100M LOC), diverse corpus of real-world projects in Java. Since methods represent natural, likely distinct units of code, we use the ~9M Java methods in the corpus to approximate a universe of "sentences." We extract their wheat by computing the function's minimal distinguishing subset (Minset). Our results confirm that functions contain work…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software Testing and Debugging Techniques
