Studying the Difference Between Natural and Programming Language Corpora
Casey Casalnuovo, Kenji Sagae, Prem Devanbu

TL;DR
This paper investigates why source code is more repetitive than natural language, suggesting that human effort and familiarity influence coding style beyond syntactic constraints, supported by studies comparing different text corpora.
Contribution
It provides empirical evidence that human choices, beyond syntax, contribute to code repetitiveness, challenging the view that syntax alone explains this phenomenon.
Findings
Repetition in code is partly due to human decision-making.
Similar patterns are observed in technical and learner corpora.
Syntax is not the sole factor influencing repetitiveness.
Abstract
Code corpora, as observed in large software systems, are now known to be far more repetitive and predictable than natural language corpora. But why? Does the difference simply arise from the syntactic limitations of programming languages? Or does it arise from the differences in authoring decisions made by the writers of these natural and programming language texts? We conjecture that the differences are not entirely due to syntax, but also from the fact that reading and writing code is un-natural for humans, and requires substantial mental effort; so, people prefer to write code in ways that are familiar to both reader and writer. To support this argument, we present results from two sets of studies: 1) a first set aimed at attenuating the effects of syntax, and 2) a second, aimed at measuring repetitiveness of text written in other settings (e.g. second language, technical/specialized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
