On the Effect of Semantically Enriched Context Models on Software   Modularization

Amir Saeidi (Utrecht University; Netherlands); Jurriaan Hage; (Universiteit Utrecht; Netherlands); Ravi Khadka (Utrecht University,; Netherlands); Slinger Jansen (Utrecht University; Netherlands)

arXiv:1708.01680·cs.SE·August 8, 2017

On the Effect of Semantically Enriched Context Models on Software Modularization

Amir Saeidi (Utrecht University, Netherlands), Jurriaan Hage, (Universiteit Utrecht, Netherlands), Ravi Khadka (Utrecht University,, Netherlands), Slinger Jansen (Utrecht University, Netherlands)

PDF

TL;DR

This paper introduces context models for source code identifiers to enhance semantic clustering and modularization, significantly improving the quality of software decomposition and topic relevance in open source Java projects.

Contribution

It proposes two novel context models for source code identifiers—type-based abstraction and data dependency graphs—to improve semantic clustering and modularization.

Findings

01

Modularization quality improved by up to 67% with context models.

02

Contextual representations yield more meaningful topics than plain text.

03

Approach validated on 10 open source Java projects.

Abstract

Many of the existing approaches for program comprehension rely on the linguistic information found in source code, such as identifier names and comments. Semantic clustering is one such technique for modularization of the system that relies on the informal semantics of the program, encoded in the vocabulary used in the source code. Treating the source code as a collection of tokens loses the semantic information embedded within the identifiers. We try to overcome this problem by introducing context models for source code identifiers to obtain a semantic kernel, which can be used for both deriving the topics that run through the system as well as their clustering. In the first model, we abstract an identifier to its type representation and build on this notion of context to construct contextual vector representation of the source code. The second notion of context is defined based on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.