On Using Machine Learning to Identify Knowledge in API Reference   Documentation

Davide Fucci; Alireza Mollaalizadehbahnemiri; Walid Maalej

arXiv:1907.09807·cs.SE·July 24, 2019

On Using Machine Learning to Identify Knowledge in API Reference Documentation

Davide Fucci, Alireza Mollaalizadehbahnemiri, Walid Maalej

PDF

1 Repo

TL;DR

This study evaluates machine learning methods for automatically identifying specific types of knowledge in API documentation, demonstrating high accuracy and exploring cross-language generalizability.

Contribution

It compares traditional and deep learning classifiers for multi-label knowledge type detection in API docs, revealing their strengths and limitations across different programming languages.

Findings

01

Deep learning achieves up to 87% AUPRC for individual knowledge types.

02

SVM outperforms deep learning on Concept, Control, Pattern, and Non-Information types.

03

Deep learning with multi-label classification reaches up to 79% MacroAUC, showing effective knowledge detection.

Abstract

Using API reference documentation like JavaDoc is an integral part of software development. Previous research introduced a grounded taxonomy that organizes API documentation knowledge in 12 types, including knowledge about the Functionality, Structure, and Quality of an API. We study how well modern text classification approaches can automatically identify documentation containing specific knowledge types. We compared conventional machine learning (k-NN and SVM) and deep learning approaches trained on manually annotated Java and .NET API documentation (n = 5,574). When classifying the knowledge types individually (i.e., multiple binary classifiers) the best AUPRC was up to 87%. The deep learning and SVM classifiers seem complementary. For four knowledge types (Concept, Control, Pattern, and Non-Information), SVM clearly outperforms deep learning which, on the other hand, is more…

Tables7

Table 1. Table 1: Twelve knowledge types included in reference documentation (adapted from Maalej and Robillard [ 3 ] ).

o lX Knowledge type		Brief description
Functionality	Describes the capabilities of the API, and what happens when it is used.
Concept	Explains terms used to describe the API behavior or the API implementation.
Directive	Describe what the user is allowed (not allowed) to do with the API.
Purpose	Explains the rationale for providing the API or for a design decision.
Quality	Describes non-functional attributes of the API, including its implementation.
Control	Describes how the API manages the control-flow and sequence of calls.
Structure	Describes the internal organization of API elements including their relationships.
Pattern	Explains how to get specific results using the API.
Example	Provides examples about the API usage.
Environment	Describes the API usage environment.
Reference	Pointers to external documents.
Non-information	Uninformative, boilerplate text.

Table 2. Table 2: Overview of the CaDO dataset.

	#documents	Words max.	Words mean	Vocab. size
.NET	2,782	2,874	89	10,630
JDK	2,792	2,099	86	10,763
Total	5,574	2,874	87	17,758

Table 3. Table 3: Summary of the corpora used to train the GloVe embedding.

ID	Name	Corpus description	#docs	#vocabulary
CC	Common Crawl	General purpose, high-quality text crawled from Internet pages	2.2M	$\sim$ 220.000
CCotf	Common Crawl on-the-fly	Common Crawl where missing words are learned from CaDO	2.2M	$\sim$ 220.000
SO	StackOverflow	StackOverflow questions and answers	20M	$\sim$ 400.000
SOapi	StackOverflow Java and .NET posts	StackOverflow questions and answers tagged as java or .net	4M	$\sim$ 100.000

Table 4. Table 4: Comparison between Deep learning classifiers (trained with embeddings from general purpose and software development corpora), traditional machine learning, and naïve approaches for classifying individual knowledge types in the CaDO dataset. Values report the Area Under Precision-Recall Curve (AUPRC).

	Naïve baselines			Traditional approaches		Deep learning (General Purpose)		Deep learning (Software dev.)
Knowledge Type	MF1	MF2	RAND	k-NN	SVM	RNN_CC	RNN_CCotf	RNN_SO	RNN_SOapi
Functionality	0.69	0.73	0.72	0.76	0.39	0.86	0.84	0.87	0.87
Concept	0.11	0.14	0.12	0.13	0.57	0.25	0.28	0.28	0.28
Directive	0.26	0.16	0.17	0.22	0.04	0.40	0.41	0.41	0.45
Purpose	0.22	0.21	0.17	0.17	0.09	0.36	0.40	0.40	0.41
Quality	0.04	0.04	0.05	0.12	0.13	0.78	0.69	0.68	0.54
Control	0.08	0.12	0.09	0.08	0.81	0.28	0.32	0.30	0.30
Structure	0.37	0.37	0.35	0.38	0.42	0.61	0.56	0.63	0.60
Pattern	0.14	0.17	0.14	0.21	0.59	0.46	0.46	0.48	0.51
Example	0.24	0.23	0.20	0.25	0.60	0.90	0.85	0.90	0.90
Environment	0.04	0.03	0.06	0.16	0.43	0.68	0.80	0.66	0.51
Reference	0.11	0.14	0.16	0.13	0.15	0.35	0.35	0.41	0.30
Non-information	0.29	0.31	0.28	0.33	0.71	0.57	0.58	0.62	0.55

Table 5. Table 5: Comparison between Deep learning classifiers (trained with embeddings from general purpose and software development corpora), traditional machine learning, and naïve approaches for classifying multiple knowledge types in CaDO .

	Naïve baselines		Traditional approaches		Deep learning (General Purpose)		Deep learning (Software dev.)
Metric	MF1	MF2	ML-kNN	OvRSVM	RNN_CC	RNN_CCotf	RNN_SO	RNN_SOapi
Hamming Loss	0.17	0.20	0.18	0.30	0.16	0.14	0.14	0.14
Subset Accuracy	0.00	0.13	0.11	0.02	0.20	0.22	0.19	0.21
MacroPrecision	0.05	0.08	0.41	0.21	0.56	0.66	0.61	0.63
MacroRecall	0.16	0.16	0.24	0.27	0.55	0.39	0.30	0.33
MacroF1	0.10	0.10	0.27	0.24	0.55	0.44	0.40	0.43
MacroAUC	0.62	0.50	0.55	0.61	0.73	0.74	0.78	0.79

Table 6. Table 6: Comparison between Deep learning classifiers (trained with embeddings from general purpose and domain specific corpora), traditional machine learning, and naïve approaches for classifying API documents based on individual knowledge type in the Python dataset. Values report the Area Under Precision-Recall Curve (AUPRC).

Naïve baselines

Traditional approaches

Deep learning (General Purpose)

Deep learning (Software dev.)

Knowledge Type

MF1

MF2

RAND

k-NN

SVM

RNN_CC

RNN_CCotf

RNN_SO

RNN_SOapi

Functionality

0.89

0.92

0.85

0.94

0.90

0.89

0.95

0.94

Concept

0.29

0.28

0.31

0.26

0.64

0.40

0.33

0.49

0.41

Directive

0.41

0.49

0.42

0.71

0.49

0.44

0.55

0.63

Purpose

0.28

0.25

0.30

0.13

0.46

0.40

0.51

0.39

Quality

0.17

0.19

0.17

0.27

0.20

0.17

0.20

0.32

Control

0.27

0.32

0.24

0.33

0.43

0.46

0.39

0.35

Structure

0.24

0.32

0.11

0.26

0.24

0.30

0.32

Pattern

0.22

0.24

0.29

0.61

0.50

0.30

0.41

0.43

Example

0.36

0.38

0.43

0.44

0.48

0.49

0.51

0.48

Environment

0.16

0.17

0.16

0.37

0.15

0.18

0.17

Reference

0.12

0.17

0.11

0.22

0.16

0.19

0.24

0.25

Non-information

0.23

0.24

0.27

0.61

0.30

0.39

0.30

0.28

Table 7. Table 7: Comparison between Deep learning classifiers (trained with embeddings from general purpose and software development corpora), traditional machine learning, and naïve approaches for classifying multiple knowledge types in Python .

Naïve

Traditional approaches

Deep learning (General Purpose)

Deep learning (Software dev.)

Metric

MF1

MF2

MLkNN

OvRSVM

RNN_CC

RNN_CCotf

RNN_SO

RNN_SOapi

Hamming Loss

0.23

0.25

0.28

0.35

0.27

0.30

0.26

0.27

Subset Accuracy

0.05

0.02

0.01

0.02

0.03

0.04

0.05

MacroPrecision

0.07

0.10

0.33

0.40

0.36

0.31

MacroRecall

0.08

0.16

0.24

0.26

0.21

0.26

MacroF1

0.07

0.13

0.28

0.30

0.29

0.28

0.25

0.28

MacroAUC

0.50

0.53

0.54

0.60

0.57

0.62

0.64

Equations6

F 1 = \frac{2 \times T r u e P os i t i v es}{2 \times T r u e P os i t i v es + F a l se P os i t i v es + F a l se N e g a t i v es}

F 1 = \frac{2 \times T r u e P os i t i v es}{2 \times T r u e P os i t i v es + F a l se P os i t i v es + F a l se N e g a t i v es}

F P R = 1 - \frac{T r u e N e g a t i v es}{T r u e N e g a t i v es + F a l se P os i t i v es}

F P R = 1 - \frac{T r u e N e g a t i v es}{T r u e N e g a t i v es + F a l se P os i t i v es}

P_{ij} = P (j ∣ i) = X_{ij} / X_{i}, X_{i} = k \sum X_{ik} .

P_{ij} = P (j ∣ i) = X_{ij} / X_{i}, X_{i} = k \sum X_{ik} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dfucci/api-doc-kn-identification
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

On Using Machine Learning to Identify Knowledge in API Reference Documentation

Davide Fucci

University of Hamburg

Hamburg

Germany

[email protected]

&Alireza Mollaalizadehbahnemiri

University of Hamburg

Hamburg

Germany

[email protected]

&Walid Maalej

University of Hamburg

Hamburg

Germany

[email protected]

Abstract

Using API reference documentation like JavaDoc is an integral part of software development. Previous research introduced a grounded taxonomy that organizes API documentation knowledge in 12 types, including knowledge about the Functionality, Structure, and Quality of an API. We study how well modern text classification approaches can automatically identify documentation containing specific knowledge types. We compared conventional machine learning (k-NN and SVM) and deep learning approaches trained on manually-annotated Java and .NET API documentation ( $n$ = 5,574). When classifying the knowledge types individually (i.e., multiple binary classifiers) the best AUPRC was up to 87%. The deep learning and SVM classifiers seem complementary. For four knowledge types (Concept, Control, Pattern, and Non-Information), SVM clearly outperforms deep learning which, on the other hand, is more accurate for identifying the remaining types. When considering multiple knowledge types at once (i.e., multi-label classification) deep learning outperforms naïve baselines and traditional machine learning achieving a MacroAUC up to 79%. We also compared classifiers using embeddings pre-trained on generic text corpora and StackOverflow but did not observe significant improvements. Finally, to assess the generalizability of the classifiers, we re-tested them on a different, unseen Python documentation dataset. Classifiers for Functionality, Concept, Purpose, Pattern, and Directive seem to generalize from Java and .NET to Python documentation. The accuracy related to the remaining types seems API-specific. We discuss our results and how they inform the development of tools for supporting developers sharing and accessing API knowledge. Published article: https://doi.org/10.1145/3338906.3338943

1 Introduction

Software developers reuse software libraries and frameworks through Application Programming Interfaces (APIs). They often rely on reference documentation to identify which API elements are relevant for the task at hand, how the API can be instantiated, configured, and combined [1]. Compared to other knowledge sources, such as tutorials and Q&A portals, reference documentation like JavaDoc and PyDoc are considered the official API technical documentation. They provide detailed and fundamental information about API elements, components, operations, and structures [2, 3].

As API documentation can be thousands of pages long [4, 5], accessing relevant knowledge can be tedious and time-consuming [1]. Moreover, the information necessary to accomplish a task can be scattered across the documentation pages of multiple elements, such as classes, methods, and properties. Thus, developers try to use other sources to fulfill their information needs. For example, although the Java Development Kit (JDK) API documentation contains more than 7,000 pages, as of early 2019, there are more than 3 million StackOverflow posts tagged as java.

Over the last decade, software engineering researchers studied what information developers need when consulting API documentation [3, 6, 7]. One line of research focuses on automatically matching information needs with the types of knowledge available in the documentation. Maalej and Robillard [3] took a first step in this direction by developing an empirically-validated taxonomy of 12 knowledge types found within API reference documentation. A single documentation page can include several knowledge types (Figure 1). Functionality and Directive are particular types of knowledge needed to accomplish a development task, whereas the Non-information type contains only uninformative boilerplate text [3]. Maalej and Robillard argue that such knowledge categorization allows for a) understanding and improving the documentation quality and b) satisfying developers’ information needs. [3]

The research community has shown interest in studying specific knowledge types contained in API reference documentation. For example, Montperrus et al. [8] and Seid et al. [9] studied Directive to prevent the violation of API usage constraints. Robillard and Chhetri [5] filtered Non-information when recommending APIs to developers. However, these automated approaches are based either on linguistic features engineering [5] or on syntactic patterns [8].

This work investigates how well modern text classification approaches can automatically identify the knowledge types suggested by Maalej and Robillard in API documentation. Based on a dataset of 5,574 labelled Java and .NET documentation, we trained, tested, and compared conventional machine learning approaches—i.e., k-Nearest Neighbors (k-NN) and Support Vector Machines (SVM)—as well as deep learning approaches—i.e., recurrent neural network (RNN) with a Long Short-Term Memory (LSTM) layer. The RNN learns features from a semantic representation of general purpose text (i.e., embeddings). Hence, we studied how our results are impacted by training the network using software development-specific corpora from StackOverflow as opposed to a general purpose one. Finally, we studied the generalizability of the classifiers to an unseen dataset obtained from the Python standard library.

This paper makes three contributions. First, we present a detailed classification benchmark for API documentation. The settings include different machine learning approaches and configurations, different word embeddings for the RNN, different datasets for different APIs, as well as various evaluation metrics. Researchers and tool vendors can use the benchmark, for example, to select and optimize a specific classifier for a specific cofiguration of API and knowledge types. Second, as we share the code and data of this study,111https://zenodo.org/badge/latestdoi/194706952 several top-performing classifiers (e.g., AUPRC $\geq$ 80%) already have practical relevance. Third, our findings and discussion of related work provide insights to researchers, tool vendors, and practitioners on how machine learning can help better organize, access, and share knowledge about API.

The rest of the paper is organized as follows. Section 2 describes our research settings, Section 3 presents the configurations of the classifiers, and Section 4 reports their performance We discuss related work in Section 5 and the implication of our results in Section 6. Finally, Section 7 concludes the paper.

2 Research Settings

This section introduces the research questions, method, and data.

2.1 Research Questions and Method

Maalej and Robillard [3] proposed an empirically-validated taxonomy of 12 knowledge types based on grounded theory and systematic content analysis (17 experienced coders, 279 person-hours effort). Section 2.1 reports the identified knowledge types which represent the basis for this work. Our primary goal is to study how well simple machine learning for text classification, without additional feature engineering or advanced natural language processing (NLP) techniques, can identify these knowledge types. That is, our classifiers label a document with one or more knowledge types.

Bibliography48

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. P. Robillard and R. De Line, “A field study of API learning obstacles,” Empirical Software Engineering , vol. 16, no. 6, pp. 703–732, 2010.
2[2] U. Dekel and J. D. Herbsleb, “Improving api documentation usability with knowledge pushing,” in Proceedings of the 31st International Conference on Software Engineering . IEEE Computer Society, 2009, pp. 320–330.
3[3] W. Maalej and M. P. Robillard, “Patterns of Knowledge in API Reference Documentation,” IEEE Trans. Softw. Eng. , vol. 39, no. 9, pp. 1264–1282, 2013.
4[4] G. Petrosyan, M. P. Robillard, and R. De Mori, “Discovering information explaining api types using text classification,” in Proceedings of the 37th International Conference on Software Engineering-Volume 1 . IEEE Press, 2015, pp. 869–879.
5[5] M. P. Robillard and Y. B. Chhetri, “Recommending reference API documentation,” Empirical Software Engineering , vol. 20, no. 6, pp. 1558–1586, Jul. 2014.
6[6] J. Stylos, B. Graf, D. K. Busse, C. Ziegler, R. Ehret, and J. Karstens, “A case study of api redesign for improved usability,” in Visual Languages and Human-Centric Computing, 2008. VL/HCC 2008. IEEE Symposium on . IEEE, 2008, pp. 189–192.
7[7] J. Stylos and B. A. Myers, “The implications of method placement on api learnability,” in Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering . ACM, 2008, pp. 105–112.
8[8] M. Monperrus, M. Eichberg, E. Tekes, and M. Mezini, “What should developers be aware of? An empirical study on the directives of API documentation,” Empirical Software Engineering , vol. 17, no. 6, pp. 703–737, 2011.