Exploring Software Reusability Metrics with Q&A Forum Data
Matthew T. Patrick

TL;DR
This paper presents LANLAN, a machine learning approach using word embeddings to analyze StackOverflow Q&A data, distinguishing problem reports from support requests to improve understanding of software reusability.
Contribution
Introduces LANLAN, a novel method leveraging word embeddings and machine learning to analyze unstructured Q&A forum data for insights into software reuse difficulties.
Findings
Achieved AUROC over 0.9 in identifying problem reports and support requests.
Demonstrated Q&A data can inform software reusability metrics.
LANLAN predicts future user difficulties effectively.
Abstract
Question and answer (Q&A) forums contain valuable information regarding software reuse, but they can be challenging to analyse due to their unstructured free text. Here we introduce a new approach (LANLAN), using word embeddings and machine learning, to harness information available in StackOverflow. Specifically, we consider two different kinds of user communication describing difficulties encountered in software reuse: 'problem reports' point to potential defects, while 'support requests' ask for clarification on software usage. Word embeddings were trained on 1.6 billion tokens from StackOverflow and applied to identify which Q&A forum messages (from two large open source projects: Eclipse and Bioconductor) correspond to problem reports or support requests. LANLAN achieved an area under the receiver operator curve (AUROC) of over 0.9; it can be used to explore the relationship…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Scientific Computing and Data Management · Open Source Software Innovations
