TL;DR
This study evaluates machine learning methods for automatically identifying specific types of knowledge in API documentation, demonstrating high accuracy and exploring cross-language generalizability.
Contribution
It compares traditional and deep learning classifiers for multi-label knowledge type detection in API docs, revealing their strengths and limitations across different programming languages.
Findings
Deep learning achieves up to 87% AUPRC for individual knowledge types.
SVM outperforms deep learning on Concept, Control, Pattern, and Non-Information types.
Deep learning with multi-label classification reaches up to 79% MacroAUC, showing effective knowledge detection.
Abstract
Using API reference documentation like JavaDoc is an integral part of software development. Previous research introduced a grounded taxonomy that organizes API documentation knowledge in 12 types, including knowledge about the Functionality, Structure, and Quality of an API. We study how well modern text classification approaches can automatically identify documentation containing specific knowledge types. We compared conventional machine learning (k-NN and SVM) and deep learning approaches trained on manually annotated Java and .NET API documentation (n = 5,574). When classifying the knowledge types individually (i.e., multiple binary classifiers) the best AUPRC was up to 87%. The deep learning and SVM classifiers seem complementary. For four knowledge types (Concept, Control, Pattern, and Non-Information), SVM clearly outperforms deep learning which, on the other hand, is more…
| o lX Knowledge type | Brief description | |
|---|---|---|
| Functionality | Describes the capabilities of the API, and what happens when it is used. | |
| Concept | Explains terms used to describe the API behavior or the API implementation. | |
| Directive | Describe what the user is allowed (not allowed) to do with the API. | |
| Purpose | Explains the rationale for providing the API or for a design decision. | |
| Quality | Describes non-functional attributes of the API, including its implementation. | |
| Control | Describes how the API manages the control-flow and sequence of calls. | |
| Structure | Describes the internal organization of API elements including their relationships. | |
| Pattern | Explains how to get specific results using the API. | |
| Example | Provides examples about the API usage. | |
| Environment | Describes the API usage environment. | |
| Reference | Pointers to external documents. | |
| Non-information | Uninformative, boilerplate text. | |
| #documents | Words max. | Words mean | Vocab. size | |
|---|---|---|---|---|
| .NET | 2,782 | 2,874 | 89 | 10,630 |
| JDK | 2,792 | 2,099 | 86 | 10,763 |
| Total | 5,574 | 2,874 | 87 | 17,758 |
| ID | Name | Corpus description | #docs | #vocabulary |
|---|---|---|---|---|
| CC | Common Crawl | General purpose, high-quality text crawled from Internet pages | 2.2M | 220.000 |
| CCotf | Common Crawl on-the-fly | Common Crawl where missing words are learned from CaDO | 2.2M | 220.000 |
| SO | StackOverflow | StackOverflow questions and answers | 20M | 400.000 |
| SOapi | StackOverflow Java and .NET posts | StackOverflow questions and answers tagged as java or .net | 4M | 100.000 |
| Naïve baselines | Traditional approaches | Deep learning (General Purpose) | Deep learning (Software dev.) | ||||||
| Knowledge Type | MF1 | MF2 | RAND | k-NN | SVM | RNNCC | RNNCCotf | RNNSO | RNNSOapi |
| Functionality | 0.69 | 0.73 | 0.72 | 0.76 | 0.39 | 0.86 | 0.84 | 0.87 | 0.87 |
| Concept | 0.11 | 0.14 | 0.12 | 0.13 | 0.57 | 0.25 | 0.28 | 0.28 | 0.28 |
| Directive | 0.26 | 0.16 | 0.17 | 0.22 | 0.04 | 0.40 | 0.41 | 0.41 | 0.45 |
| Purpose | 0.22 | 0.21 | 0.17 | 0.17 | 0.09 | 0.36 | 0.40 | 0.40 | 0.41 |
| Quality | 0.04 | 0.04 | 0.05 | 0.12 | 0.13 | 0.78 | 0.69 | 0.68 | 0.54 |
| Control | 0.08 | 0.12 | 0.09 | 0.08 | 0.81 | 0.28 | 0.32 | 0.30 | 0.30 |
| Structure | 0.37 | 0.37 | 0.35 | 0.38 | 0.42 | 0.61 | 0.56 | 0.63 | 0.60 |
| Pattern | 0.14 | 0.17 | 0.14 | 0.21 | 0.59 | 0.46 | 0.46 | 0.48 | 0.51 |
| Example | 0.24 | 0.23 | 0.20 | 0.25 | 0.60 | 0.90 | 0.85 | 0.90 | 0.90 |
| Environment | 0.04 | 0.03 | 0.06 | 0.16 | 0.43 | 0.68 | 0.80 | 0.66 | 0.51 |
| Reference | 0.11 | 0.14 | 0.16 | 0.13 | 0.15 | 0.35 | 0.35 | 0.41 | 0.30 |
| Non-information | 0.29 | 0.31 | 0.28 | 0.33 | 0.71 | 0.57 | 0.58 | 0.62 | 0.55 |
| Naïve baselines | Traditional approaches | Deep learning (General Purpose) | Deep learning (Software dev.) | |||||
| Metric | MF1 | MF2 | ML-kNN | OvRSVM | RNNCC | RNNCCotf | RNNSO | RNNSOapi |
| Hamming Loss | 0.17 | 0.20 | 0.18 | 0.30 | 0.16 | 0.14 | 0.14 | 0.14 |
| Subset Accuracy | 0.00 | 0.13 | 0.11 | 0.02 | 0.20 | 0.22 | 0.19 | 0.21 |
| MacroPrecision | 0.05 | 0.08 | 0.41 | 0.21 | 0.56 | 0.66 | 0.61 | 0.63 |
| MacroRecall | 0.16 | 0.16 | 0.24 | 0.27 | 0.55 | 0.39 | 0.30 | 0.33 |
| MacroF1 | 0.10 | 0.10 | 0.27 | 0.24 | 0.55 | 0.44 | 0.40 | 0.43 |
| MacroAUC | 0.62 | 0.50 | 0.55 | 0.61 | 0.73 | 0.74 | 0.78 | 0.79 |
|
|
|
|
||||||||||
| Knowledge Type | MF1 | MF2 | RAND | k-NN | SVM | RNNCC | RNNCCotf | RNNSO | RNNSOapi | ||||
| Functionality | 0.89 | 0.89 | 0.92 | 0.85 | 0.94 | 0.90 | 0.89 | 0.95 | 0.94 | ||||
| Concept | 0.29 | 0.28 | 0.31 | 0.26 | 0.64 | 0.40 | 0.33 | 0.49 | 0.41 | ||||
| Directive | 0.41 | 0.41 | 0.49 | 0.42 | 0.71 | 0.49 | 0.44 | 0.55 | 0.63 | ||||
| Purpose | 0.28 | 0.28 | 0.25 | 0.30 | 0.13 | 0.46 | 0.40 | 0.51 | 0.39 | ||||
| Quality | 0.17 | 0.17 | 0.19 | 0.17 | 0.27 | 0.20 | 0.17 | 0.20 | 0.32 | ||||
| Control | 0.27 | 0.27 | 0.32 | 0.24 | 0.33 | 0.43 | 0.46 | 0.39 | 0.35 | ||||
| Structure | 0.24 | 0.24 | 0.24 | 0.32 | 0.11 | 0.26 | 0.24 | 0.30 | 0.32 | ||||
| Pattern | 0.22 | 0.22 | 0.24 | 0.29 | 0.61 | 0.50 | 0.30 | 0.41 | 0.43 | ||||
| Example | 0.36 | 0.36 | 0.38 | 0.43 | 0.44 | 0.48 | 0.49 | 0.51 | 0.48 | ||||
| Environment | 0.16 | 0.16 | 0.17 | 0.16 | 0.37 | 0.15 | 0.15 | 0.18 | 0.17 | ||||
| Reference | 0.12 | 0.12 | 0.17 | 0.11 | 0.22 | 0.16 | 0.19 | 0.24 | 0.25 | ||||
| Non-information | 0.23 | 0.23 | 0.24 | 0.27 | 0.61 | 0.30 | 0.39 | 0.30 | 0.28 | ||||
|
|
|
|
|||||||||
| Metric | MF1 | MF2 | MLkNN | OvRSVM | RNNCC | RNNCCotf | RNNSO | RNNSOapi | ||||
| Hamming Loss | 0.23 | 0.25 | 0.28 | 0.35 | 0.27 | 0.30 | 0.26 | 0.27 | ||||
| Subset Accuracy | 0.05 | 0.05 | 0.02 | 0.01 | 0.02 | 0.03 | 0.04 | 0.05 | ||||
| MacroPrecision | 0.07 | 0.10 | 0.33 | 0.40 | 0.36 | 0.31 | 0.31 | 0.31 | ||||
| MacroRecall | 0.08 | 0.16 | 0.24 | 0.24 | 0.24 | 0.26 | 0.21 | 0.26 | ||||
| MacroF1 | 0.07 | 0.13 | 0.28 | 0.30 | 0.29 | 0.28 | 0.25 | 0.28 | ||||
| MacroAUC | 0.50 | 0.50 | 0.53 | 0.54 | 0.60 | 0.57 | 0.62 | 0.64 | ||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
On Using Machine Learning to Identify Knowledge in API Reference Documentation
Davide Fucci
University of Hamburg
Hamburg
Germany
&Alireza Mollaalizadehbahnemiri
University of Hamburg
Hamburg
Germany
&Walid Maalej
University of Hamburg
Hamburg
Germany
Abstract
Using API reference documentation like JavaDoc is an integral part of software development. Previous research introduced a grounded taxonomy that organizes API documentation knowledge in 12 types, including knowledge about the Functionality, Structure, and Quality of an API. We study how well modern text classification approaches can automatically identify documentation containing specific knowledge types. We compared conventional machine learning (k-NN and SVM) and deep learning approaches trained on manually-annotated Java and .NET API documentation ( = 5,574). When classifying the knowledge types individually (i.e., multiple binary classifiers) the best AUPRC was up to 87%. The deep learning and SVM classifiers seem complementary. For four knowledge types (Concept, Control, Pattern, and Non-Information), SVM clearly outperforms deep learning which, on the other hand, is more accurate for identifying the remaining types. When considering multiple knowledge types at once (i.e., multi-label classification) deep learning outperforms naïve baselines and traditional machine learning achieving a MacroAUC up to 79%. We also compared classifiers using embeddings pre-trained on generic text corpora and StackOverflow but did not observe significant improvements. Finally, to assess the generalizability of the classifiers, we re-tested them on a different, unseen Python documentation dataset. Classifiers for Functionality, Concept, Purpose, Pattern, and Directive seem to generalize from Java and .NET to Python documentation. The accuracy related to the remaining types seems API-specific. We discuss our results and how they inform the development of tools for supporting developers sharing and accessing API knowledge. Published article: https://doi.org/10.1145/3338906.3338943
1 Introduction
Software developers reuse software libraries and frameworks through Application Programming Interfaces (APIs). They often rely on reference documentation to identify which API elements are relevant for the task at hand, how the API can be instantiated, configured, and combined [1]. Compared to other knowledge sources, such as tutorials and Q&A portals, reference documentation like JavaDoc and PyDoc are considered the official API technical documentation. They provide detailed and fundamental information about API elements, components, operations, and structures [2, 3].
As API documentation can be thousands of pages long [4, 5], accessing relevant knowledge can be tedious and time-consuming [1]. Moreover, the information necessary to accomplish a task can be scattered across the documentation pages of multiple elements, such as classes, methods, and properties. Thus, developers try to use other sources to fulfill their information needs. For example, although the Java Development Kit (JDK) API documentation contains more than 7,000 pages, as of early 2019, there are more than 3 million StackOverflow posts tagged as java.
Over the last decade, software engineering researchers studied what information developers need when consulting API documentation [3, 6, 7]. One line of research focuses on automatically matching information needs with the types of knowledge available in the documentation. Maalej and Robillard [3] took a first step in this direction by developing an empirically-validated taxonomy of 12 knowledge types found within API reference documentation. A single documentation page can include several knowledge types (Figure 1). Functionality and Directive are particular types of knowledge needed to accomplish a development task, whereas the Non-information type contains only uninformative boilerplate text [3]. Maalej and Robillard argue that such knowledge categorization allows for a) understanding and improving the documentation quality and b) satisfying developers’ information needs. [3]
The research community has shown interest in studying specific knowledge types contained in API reference documentation. For example, Montperrus et al. [8] and Seid et al. [9] studied Directive to prevent the violation of API usage constraints. Robillard and Chhetri [5] filtered Non-information when recommending APIs to developers. However, these automated approaches are based either on linguistic features engineering [5] or on syntactic patterns [8].
This work investigates how well modern text classification approaches can automatically identify the knowledge types suggested by Maalej and Robillard in API documentation. Based on a dataset of 5,574 labelled Java and .NET documentation, we trained, tested, and compared conventional machine learning approaches—i.e., k-Nearest Neighbors (k-NN) and Support Vector Machines (SVM)—as well as deep learning approaches—i.e., recurrent neural network (RNN) with a Long Short-Term Memory (LSTM) layer. The RNN learns features from a semantic representation of general purpose text (i.e., embeddings). Hence, we studied how our results are impacted by training the network using software development-specific corpora from StackOverflow as opposed to a general purpose one. Finally, we studied the generalizability of the classifiers to an unseen dataset obtained from the Python standard library.
This paper makes three contributions. First, we present a detailed classification benchmark for API documentation. The settings include different machine learning approaches and configurations, different word embeddings for the RNN, different datasets for different APIs, as well as various evaluation metrics. Researchers and tool vendors can use the benchmark, for example, to select and optimize a specific classifier for a specific cofiguration of API and knowledge types. Second, as we share the code and data of this study,111https://zenodo.org/badge/latestdoi/194706952 several top-performing classifiers (e.g., AUPRC 80%) already have practical relevance. Third, our findings and discussion of related work provide insights to researchers, tool vendors, and practitioners on how machine learning can help better organize, access, and share knowledge about API.
The rest of the paper is organized as follows. Section 2 describes our research settings, Section 3 presents the configurations of the classifiers, and Section 4 reports their performance We discuss related work in Section 5 and the implication of our results in Section 6. Finally, Section 7 concludes the paper.
2 Research Settings
This section introduces the research questions, method, and data.
2.1 Research Questions and Method
Maalej and Robillard [3] proposed an empirically-validated taxonomy of 12 knowledge types based on grounded theory and systematic content analysis (17 experienced coders, 279 person-hours effort). Section 2.1 reports the identified knowledge types which represent the basis for this work. Our primary goal is to study how well simple machine learning for text classification, without additional feature engineering or advanced natural language processing (NLP) techniques, can identify these knowledge types. That is, our classifiers label a document with one or more knowledge types.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. P. Robillard and R. De Line, “A field study of API learning obstacles,” Empirical Software Engineering , vol. 16, no. 6, pp. 703–732, 2010.
- 2[2] U. Dekel and J. D. Herbsleb, “Improving api documentation usability with knowledge pushing,” in Proceedings of the 31st International Conference on Software Engineering . IEEE Computer Society, 2009, pp. 320–330.
- 3[3] W. Maalej and M. P. Robillard, “Patterns of Knowledge in API Reference Documentation,” IEEE Trans. Softw. Eng. , vol. 39, no. 9, pp. 1264–1282, 2013.
- 4[4] G. Petrosyan, M. P. Robillard, and R. De Mori, “Discovering information explaining api types using text classification,” in Proceedings of the 37th International Conference on Software Engineering-Volume 1 . IEEE Press, 2015, pp. 869–879.
- 5[5] M. P. Robillard and Y. B. Chhetri, “Recommending reference API documentation,” Empirical Software Engineering , vol. 20, no. 6, pp. 1558–1586, Jul. 2014.
- 6[6] J. Stylos, B. Graf, D. K. Busse, C. Ziegler, R. Ehret, and J. Karstens, “A case study of api redesign for improved usability,” in Visual Languages and Human-Centric Computing, 2008. VL/HCC 2008. IEEE Symposium on . IEEE, 2008, pp. 189–192.
- 7[7] J. Stylos and B. A. Myers, “The implications of method placement on api learnability,” in Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering . ACM, 2008, pp. 105–112.
- 8[8] M. Monperrus, M. Eichberg, E. Tekes, and M. Mezini, “What should developers be aware of? An empirical study on the directives of API documentation,” Empirical Software Engineering , vol. 17, no. 6, pp. 703–737, 2011.
