Modeling Hierarchical Usage Context for Software Exceptions based on Interaction Data
Hui Chen, Kostadin Damevski, David Shepherd, Nicholas A. Kraft

TL;DR
This paper introduces a probabilistic hierarchical model that combines user interaction traces and software fault reports to better understand and categorize software exceptions, aiding developers in diagnosing issues.
Contribution
It presents a novel unsupervised Bayesian non-parametric model that hierarchically models interaction and fault data for improved exception analysis.
Findings
Model effectively captures co-occurring commands and exceptions.
Hierarchical topic structure aids in categorizing exceptions.
Application to large-scale data demonstrates practical utility.
Abstract
Traces of user interactions with a software system, captured in production, are commonly used as an input source for user experience testing. In this paper, we present an alternative use, introducing a novel approach of modeling user interaction traces enriched with another type of data gathered in production - software fault reports consisting of software exceptions and stack traces. The model described in this paper aims to improve developers' comprehension of the circumstances surrounding a specific software exception and can highlight specific user behaviors that lead to a high frequency of software faults. Modeling the combination of interaction traces and software crash reports to form an interpretable and useful model is challenging due to the complexity and variance in the combined data source. Therefore, we propose a probabilistic unsupervised learning approach, adapting the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Software Testing and Debugging Techniques
∎
11institutetext: H. Chen 22institutetext: Department of Computer and Information Science
Brooklyn College of the City University of New York
Brooklyn, NY 11210, U.S.A.
22email: [email protected] 33institutetext: K. Damevski 44institutetext: Department of Computer Science
Virginia Commonwealth University
Richmond, VA 23284, U.S.A.
44email: [email protected] 55institutetext: D.C. Shepherd 66institutetext: ABB Corporate Research
Raleigh, NC 27606 U.S.A.
66email: [email protected] 77institutetext: N.A. Kraft 88institutetext: ABB Corporate Research
Raleigh, NC 27606 U.S.A.
88email: [email protected]
Modeling Hierarchical Usage Context for Software Exceptions based on
Interaction Data
Hui Chen
Kostadin Damevski
David Shepherd
Nicholas A. Kraft
(Received: date / Accepted: date)
Abstract
Traces of user interactions with a software system, captured in production, are commonly used as an input source for user experience testing. In this paper, we present an alternative use, introducing a novel approach of modeling user interaction traces enriched with another type of data gathered in production - software fault reports consisting of software exceptions and stack traces. The model described in this paper aims to improve developers’ comprehension of the circumstances surrounding a specific software exception and can highlight specific user behaviors that lead to a high frequency of software faults.
Modeling the combination of interaction traces and software crash reports to form an interpretable and useful model is challenging due to the complexity and variance in the combined data source. Therefore, we propose a probabilistic unsupervised learning approach, adapting the Nested Hierarchical Dirichlet Process, which is a Bayesian non-parametric hierarchical topic model originally applied to natural language data. This model infers a tree of topics, each of whom describes a set of commonly co-occurring commands and exceptions. The topic tree can be interpreted hierarchically to aid in categorizing the numerous types of exceptions and interactions. We apply the proposed approach to large scale datasets collected from the ABB RobotStudio software application, and evaluate it both numerically and with a small survey of the RobotStudio developers.
Keywords:
Stack Trace, Crash Report, Software Exception, Software Interaction Trace, Hierarchical Topic Model
1 Introduction
Continuous monitoring of deployed software usage is now a standard approach in industry. Developers leverage usage data to discover and correct faults, performance bottlenecks, or inefficient user interface design. This practice has led to a debugging methodology called “debugging in the large”, a postmortem analysis of large amount of usage data to recognize patterns of bugs Han:2012:PDL:2337223.2337241 ; glerum2009debugging . For instance, Arnold et al. use application stack traces to group processes exhibiting similar behavior called “process equivalence classes”, and identify what differentiate these classes with the aim to discover the root cause of the bugs associated with the stack traces arnold2007stack . Han et al. clusters stack traces and recognize patterns of stack traces to discover impactful performance bugs Han:2012:PDL:2337223.2337241 .
Software-as-a-service applications often gather monitoring data at the service host, while user-installed client software collects relevant traces (or logs) periodically at the user’s machines and transferred them from users’ machines to a server. The granularity and format of the collected data (e.g., whether the format of the data is a raw/log form or as a set of derivative metrics) depend on the specific application and deployment. Two types of data commonly collected via monitoring include software exceptions, containing a stack traces from software faults that occur in production, and interaction traces, containing details of user interactions with the software’s interface.
By utilizing datasets that contain both of these two types of data, we can provide a novel perspective on interpreting frequently occurring stack traces resulting from software exceptions by modeling them in concert with the user interactions with which they co-occur. Our approach probabilistically represents stack traces and their interaction context for the purpose of increasing developer understanding of specific software faults and the contexts in which they appear. Over time, this understanding can help developers to reproduce exceptions, to prioritize software crash reports based on their user impact, or to identify specific user behaviors that tend to trigger failures. Existing works attempt to empirically characterize software crash reports in application domains like operating systems, networking software, and open source software applications Yin:2010:TUB:1823844.1823849 ; Chou:2001:ESO:502059.502042 ; Li:2006:TCE:1181309.1181314 ; Lu:2008:LMC:1353535.1346323 , but none have used interaction traces containing stack traces for the purpose of fault characterization debugging.
Interaction traces can be challenging to analyze. First, the logged interactions are typically low-level, corresponding to most mouse clicks and key presses available in the software application, and therefore the raw number of interactions in these traces can be large — containing millions of messages from different users. Second, for complex software applications, there are often multiple reasonable interaction paths to accomplish a specific high-level task while interaction traces that lead to different tasks can share shorter but common interaction paths. To address these two challenges of scale and of uncertainty in interpreting interaction traces, we posit that probabilistic dimension reduction techniques that can extract frequent patterns from the low-level interaction data are the right choice to analyze interaction traces.
Topic models are such a dimensionality reduction technique with the capacity to discover complex latent thematic structures. Typically applied to large textual document collections, such models can naturally capture the uncertainty in software interaction data using probabilistic assumptions; however, in cases where the interaction traces are particularly complex, e.g., in complex software applications such as IDEs or CAD tools, applying typical topic models may still result in a large topic space that is difficult to interpret. The special class of hierarchical topic models encodes a tree of related topics, enabling further reduction in complexity and dimensionality of the original interaction data and improving the interpretability of the model. We apply a hierarchical topic modeling technique, called the Nested Hierarchical Dirichlet Process (NHDP) 6802355 to combine interaction traces and stack traces gathered from a complex software application into a single, compact representation. The NHDP discovers a hierarchical structure of usage events that has the following characteristics:
- •
provides an interpretable summary of the user interactions that commonly co-occur with specific stack traces;
- •
allows for differentiating the strength of the relationship between specific interaction trace messages and a stack trace; and
- •
enables locating specific interactions that have co-occurred with numerous runtime errors.
In addition, as a Bayesian non-parametric modeling technique, NHDP has an additional advantage. It allows the model to grow structurally as it observes more data. Specifically, instead of imposing a fixed set of topics or hypotheses about the relationship of the topics, the model grows its hierarchy to fit the data, i.e., to “let the data speak” Blei:2010:NCR:1667053.1667056 . This is beneficial in modeling the datasets of interest since users’ interaction with software changes as the software does, e.g., by adding new features or removing (or introducing) new bugs.
The main contributions of this paper are as follows:
- •
We apply a hierarchical topic model to a large collection of interaction and stack trace data produced by ABB RobotStudio, a popular robot programming platform developed at ABB Inc, and examine how effective it extracts latent thematic structures of the dataset and how well the structure depicts a context for exceptions occurring during the production use of RobotStudio.
- •
We are first to propose the idea of grouping users’ IDE interaction traces with stack traces hierarchically and probabilistically into “clusters”. These “clusters” provide user interaction contexts of stack traces. Since a stack trace may be the result of multiple different interaction contexts, this approach associates a stack trace with its contexts probabilistically.
We organize the remainder of this paper as follows. Section 2 introduces the types of interaction and stack trace data we use and how we prepare these data sources for topic modeling. We describe the hierarchical topic modeling technique and its application to software interaction and crash data in Section 3. We apply the modeling technique to the large RobotStudio dataset and provide an evaluation in Section 4. Our work is not without threats to its validity, which we discuss in Section 5. In Section 6, we describe relevant related research and conclude this paper in Section 7.
2 Background
Interaction data gathered from complex software applications, such as IDEs111The Eclipse UDC dataset is a well known source of this type of data in the software engineering community. Available at: http://archive.eclipse.org/projects/usagedata/, typically consists of a large vocabulary of messages, ordered in a time series. The data is typically collected exhaustively, in order to capture user actions in an interpretable, logical sequence. As users complete certain actions much more often than others, the occurrence of interaction messages follows a skewed distribution where some messages appear often, while most occur infrequently. Some of the messages are direct results of user actions (i.e., commands), while the others may reflect the state of the application (i.e., events), such as the completion of a background task like a project build. Consider the below snippet of an interaction trace, gathered in Visual Studio, part of the Blaze dataset Snipes:2014:EGD:2591062.2591171 ; Damevski_Mining_2017
2014-02-06 17:12:12 Debug.Start 2014-02-06 17:14:14 Build.BuildBegin 2014-02-06 17:14:16 Build.BuildDone 2014-02-06 17:14:50 View.OnChangeCaretLine 2014-02-06 17:14:50 Debug.Debug Break Mode 2014-02-06 17:15:02 Debug.EnableBreakpoint 2014-02-06 17:15:06 Debug.EnableBreakpoint 2014-02-06 17:15:10 Debug.Start 2014-02-06 17:15:10 Debug.Debug Run Mode
The developer that generated the above interaction trace snippet is starting the interactive debugger, observed by the Debug.Start command. This triggers an automatic build in Visual Studio, shown by the Build.BuildBegin and Build.BuildDone, the exact same log messages that appear when the user explicitly requests the build to begin. After the debugger stops at a breakpoint, Debug.Debug Break Mode, this developer enables two previously disabled breakpoints (e.g., Debug.EnableBreakpoint) and restarts (or resumes) debugging (such as, Debug.Start and Debug.Debug Run Mode).
We leverage a probabilistic approach where we model each extracted high-level behavior as a probability distribution of interaction messages. This type of model is able to capture the noisy nature of interaction data soh_noises_2015 , which stems from the fact that 1) numerous paths that represent a specific high-level behavior exist (e.g., using ToggleBreakpoint versus EnableBreakpoint has the same effect) and 2) unrelated events may be in the midst of a set of interactions (e.g., Debug.BuildDone can occur at intervals beginning at Debug.BuildStart and interspersed with other messages).
One particular application domain where probabilistic models have been effective for extracting high-level contexts, or topics, is natural language processing. In natural language texts words are the most basic unit of the discrete data and documents can be sets of words (i.e., a “bag of words” assumption). We can draw an analogy from the characteristics of interaction traces to natural language texts, i.e., interaction traces exhibit naming relations such as synonymy and polysemy similar to those in natural language texts. A trace often contains multiple different messages that share meaning in a specific behavioral context, e.g., both the ToggleBreakpoint and EnableBreakpoint events have the same meaning in the same context. This is similar to the notion of synonymy in natural languages, where different words can have the same meaning in a given context. Similarly, IDE commands carry a different meaning depending on the task that the developer is performing, e.g., an error in building the project after pulling code from the repository has a different meaning than encountering a build error after editing the code base. This characteristic is akin to polysemy in natural language, where one word can have different meanings based on its context.
Figure 1 shows an example of two IDE traces containing both interactions and stack traces from the ABB RobotStudio IDE. Both of these traces correspond to user writing a program using a programming language called RAPID into this environment’s editor, and performing common actions like cutting-and-pasting and cursor movement (i.e., EditCut, EditPaste, and ProgramSetCursor). In both trace excerpts the users encountered the identical exception, RobApiException [...] RAPID symbol was not found, as identified by its type and message. While corresponding to the same high-level user behavior, the sequence and constituent messages occurring in the two interaction traces are slightly different. The modeling approach described here is able to capture the common interaction context of RobApiException, forming high-level user behaviors that we represent as a probabilistic distribution of interaction messages and stack traces, shown in the right part of Figure 1. The model is able to overcome the slightly different composition and order in the two interaction traces, extracting their commonalities, and can help better characterize and understand the context of the shown exception’s stack trace.
The above motivates us to seek an algorithm to find not only useful sets of patterns of user behaviors, and learn to organize the these patterns according to a hierarchy in which more generic or abstract patterns near the root of the hierarchy and more concrete patterns are near the leaves. This hierarchy would allow us to explore stack traces and associated user interactions from the generic to the specific, in a way no different from what we do in our daily lives, i.e. when we go to a grocery store, we begin with a particular section, and then down to a specific aisle, finally locating a particular product.
2.1 Topic Models for Interaction Data
Given a collection of natural language documents, topic modeling allows one to discover latent thematic structures in the document collection (commonly called a corpus) blei2003latent . A document in the corpus is an unordered set of words (i.e., “a bag of words”). The vocabulary of the corpus, denoted as , consists of the unique words in the corpus. A topic is a discrete probability distribution over the vocabulary words. A collection of topics describe the extracted thematic structures in the corpus. For instance, given the vocabulary of a corpus, denoted as , a topic is a discrete probability distribution represented by its probability mass function, , where , . Topic models provide means to express the thematic structures in a document and a document collection, i.e., using topics and the relationship among the topics. For instance, in Latent Dirichlet allocation (LDA) blei2003latent , a popular flat (non-hierarchical) topic model, the thematic structures in the document collection includes the proportions of each topic exhibited in the collection or in a specific document in the collection.
Topic models are readily applied to other types of data because the models do not rely on any natural language specific information or assumptions, e.g., a language grammar. Examples of data types other than textual data for which topic modeling has found success include image collections, genetic information, and social network data 4408965 ; Pritchard945 ; Wang:2012:TEO:2339530.2339552 .
In this paper, we apply topic models to interaction traces with embedded stack traces. We begin by dividing an interaction trace into segments (or windows). First, we treat each segment as a “document” and each command, event, and stack trace as a word. Furthermore, when we examine a small segment of an interaction trace, we find that a segment consists of usually highly regular and repetitive patterns. This is likely the result of the following observation of user behavior. Within a small period of time, a user is likely focusing on a specific task and interacting with a small subset of the development environment, resulting in segments with a small number of interaction messages. In addition, interaction traces exhibit two naming relations, namely synonymy and polysemy that also exist in natural texts. The former refers to that a user can use a command to complete multiple types of tasks, and the later that the user can accomplish a task via different types of commands Damevski_Mining_2017 . We posit that these relationships between the interaction types within small units of IDE usage time mimics the “naturalness” of text hindle2012naturalness , which suggests that models used for analyzing natural language text can be applied to IDE interaction data. In this paper, interaction trace messages are the words, segments of interactions messages are the documents, and all of the observed segments are the corpus of the study. Note that we use the term “window” to represent a segment as we use the moving window method described below to divide an interaction trace into segments.
Interaction traces consist of frequently occurring low-level messages corresponding to 1) user actions and commands (e.g., copying text into the clipboard, pasting text from the clipboard, building the project); and 2) events that occur asynchronously (e.g., completion of the build, stopping at a breakpoint when debugging). The sequential order between the messages is only relevant to some behaviors, but not to others. For instance, the event indicating the completion of the build may be important to the next set of actions the developer performs, or it may be occurring in the background without import.
In our model, following the “bag of words” assumption, we use a tight moving window of interaction messages generated by an individual developer, but ignore the message order within the window. This is a reasonable modeling assumption that captures sequential order but resilient to small permutations in message order within the window. In addition, developer interaction traces often contain large time gaps, stemming from breaks in the individual developer’s work. To take account of these we force window breaks when the time between two consecutive messages exceeds a predefined threshold. An interaction window is a sequence of messages denoted as where is the -th message in the sequence. A corpus is a set of windows, denoted as where .
Software exceptions and stack traces, reporting a software fault, which may or may not be fatal and result in the software to crash, commonly contain a time stamp and some type of user/machine identifier that tie them to interactions from the same user. We use a dataset that interleaves the interactions with the stack traces. We use a window-based modeling technique, as such, minor timing issues in relating interaction and software crash data become unimportant, as long as we tie the stack trace of the crash with the relevant window of interaction messages. Assuming this reasonable assumption holds, we treat the stack trace as just another message in the interaction log, i.e., the “vocabulary” becomes , where is an interaction message and is a stack trace. Following the “bag of words” assumption, we represent document to the term-frequency form, i.e., where is the frequency of word , either an interaction message or a stack trace in vocabulary .
3 Hierarchical Topic Modeling for Interaction Data
The scale of IDE interaction traces collected from the field can pose a challenge to analysis. The size of the traces can grow quickly and become large, for instance, the Eclipse Foundation Filtered UDC Data set consists of on the order of messages a day. Our approach is to divide the traces into message windows. To accomplish this, we first divide the traces into active sessions, using a prolonged time gap between messages as a delimiter, and further divide each session into one or more windows, each of which is a sequence of a fixed number of messages. Stack traces appear in the interaction log from time to time. We treat them as ordinary messages in the windows in the model. In the remainder of the paper, to be consistent with prior literature on topic models, we sometimes refer to a message window as a document, and messages within that window as words.
Our windowing approach bears similarity to the data processing method commonly used for streaming text corpora, such as, transcripts of automatic speech recognized streaming audio, transcripts of closed captioning, and feeds from the news wire Blei:2001:TSA:383952.384021 . Among these kinds of datasets, no explicit document breaks exist. A common approach is to divide the text into “documents” of a fixed length, as we have.
Most topic models, such as LDA, are flat topic models, in which the topics are independent and there is no structural relationship among the discovered topics. There are two challenges facing flat topic models. First, it is difficult or at least computationally expensive to discover the number of topics that we should model in a document collection. Second, since there is only a rudimentary relationship among topics, the meaning of the topics is difficult to interpret, in particular, when multiple topics may look alike based on their probability distributions.
We use a hierarchical topic model based on the Nested Hierarchical Dirichlet Process (NHDP), which, compared with a flat topic model, arranges the topics in a tree where more generic topics appear on upper levels of the tree while more specific topics appear at lower levels. We can achieve two objectives via a hierarchical topic model. The number of topics for a model can be easily expressed in the hierarchy, much like the hierarchical clustering algorithm where we can determine the number of clusters by increasing gradually the depth and the branches of the tree of clusters. In addition, the hierarchical structure of the topics, i.e., more generic topics appearing on upper levels of the tree and more specific topics on lower levels can lead to improved human interpretability. As argued in Blei:2010:NCR:1667053.1667056 , “if interpretability is the goal, then there are strong reasons to prefer” a hierarchical topic model, such as, NHDP over a flat topic model, such as, LDA.
A number of hierarchical topic models exist in the literature. We choose the Nested Hierarchical Dirichlet Process (NHDP) Blei:2010:NCR:1667053.1667056 as it possesses some advantages over other popular hierarchical models, such as the Hierarchical Latent Dirichlet Allocation (HLDA) Blei:2010:NCR:1667053.1667056 . Different from these models, NHDP results in a more compact hierarchy of topics (less branching) and produces less repetitive topics as it allows a document to sample topics from a subtree that is not limited to a path from the root of the tree. For the IDE interaction traces of our interest, NHDP is a right modeling tool because a stack trace can occur at different interaction contexts and we can capture this variability effectively at higher (more general) levels of its hierarchy and differentiate the contexts at lower (more specific) level of the hierarchy.
To understand how we may apply the NHDP topic model to analyze software interaction traces, we illustrate the model in Figure 2 as a directed graph, i.e., a Bayesian network. Since NHDP is a Bayesian model, it starts with a prior. In effect, the name of the NHDP topic model comes from that of its prior, i.e., the nested hierarchical Dirichlet process. The prior expresses the assumption that the thematic structure of the topics is in a tree-like structure and the assumption that a topic can have branches corresponding to more specific topics at lower level in the tree. We specify or tune these assumptions by giving a number of parameters of the prior as inputs to the model, commonly referred to as the hyperparameters of the model. We provide an overview of these hyperparameters and their relationship with other variables in the graph in Figure 2.
In NHDP, we consider words in documents to follow Multinomial distributions, given a topic. Dirichlet distributions are a commonly used prior for multinomial distributions. It follows that we draw topics, a set of multinomial distributions over words from given Dirichlet distributions. As shown in Figure 2, given a hyperparameter as the parameter for a Dirichlet distribution, we draw potentially infinite number of topics, denoted as , in Figure 2. Since we choose a symmetric Dirichlet distribution for generating topic distributions for this work, hyperparameter is a positive scalar, and represents the concentration parameter of the Dirichlet distribution . The smaller is, more concentrated on fewer words we believe a topic to be.
A topic corresponds to a node in global topic tree . We can either draw a global topic tree from a nested Chinese Restaurant Process as illustrated in Blei:2010:NCR:1667053.1667056 or construct it directly using a nested Stick Breaking Process as shown in 6802355 . Both of these two methods yield an infinite set of Dirichlet processes, each corresponding to a node in the tree. A Dirichlet process, an infinitely decimated Dirichlet distribution, allows us to branch from a topic node to an infinite number of child topic nodes, which constitutes the mechanism to build the topic tree. A Dirichlet process is a distribution from which a draw is also a probability distribution. We denote drawing a probability distribution from a Dirichlet process as where concentration parameter and base measure are two hyperparameters as shown in Figure 2. The probability distributions drawn from the Dirichlet process provide a parameter to associate a node in the topic tree to its corresponding topic (). The concentration parameter , where represents our belief on how we should branch a topic node to topic nodes on a lower level. The greater the , the more branches we should expect when given a corpus.
When examining the relationship of the topics, we know that the topics depend on the manner that we derive document trees in the model. A document tree is a copy of the global topic tree of the given corpus with the same topics on the nodes but with different branching probabilities. As discussed above, an important characteristic of NHDP is its prior, the nested hierarchical Dirichlet process that leads to the mechanism by which we branch a topic node to a lower level. Each node in the global tree has a corresponding Dirichlet process. Let’s denote the Dirichlet process at a node in the global tree as , the corresponding node in the topologically identical document topic tree for document has a Dirichlet process , where the concentration parameter controls our belief on how a document branches in the corresponding document tree, i.e., hyperparameter controls how the branching probability mass is distributed among branches. The higher the , the less concentrated the branching probability mass is, and in effect, the more branches we should expect from a corpus. For instance, if we expect a document in the corpus should branch to a small number of topics in next level, all the while we expect these topics to be different among different documents, we should begin with a large and a small because we expect effectively a large global tree, but small document trees.
Furthermore, each word in a document has a topic since we sample words from a topic, i.e., a discrete probability distribution over words. We conveniently refer a topic by using its index. Denote the index of the -th word’s topic in document as as shown in Figure 2. We determine the topic for the word from a two-step approach. First, we choose a path from the root in the document tree based on the tree’s branching probabilities. Next, we select a topic along the path for the word based on a probability distribution — starting from the root along the path, we draw from Beta distribution , and is the probability that we remain on the node, and is the probability that we switch to next node along the path. The two parameters control the expected range of the level switching probabilities. The Beta distribution here is commonly used to express a probability distribution of probabilities.
These hyperparameters have an impact on the learned NHDP model and inference of new documents. In Section 4, we evaluate how sensitive the learned NHDP model is to the hyperparameters. An insensitive model has stronger ability to correct inaccurate hyperparameter priors by learning what the data implies.
3.1 Learning the NHDP Model
To learn a NHDP model from a document corpus, we adopt the stochastic inference algorithm in 6802355 . The algorithm has the following steps:
Scan the documents from the training corpus, and extract words to form a vocabulary of the training corpus. In this step, the vocabulary consists of IDE messages and stack traces. We treat a stack trace as a single word. Denote the vocabulary as that consists of unique words. 2. 2.
Index words in the vocabulary from [math] to , and convert each document to a term-frequency vector where the value at position is the frequency of the word indexed by in the document. 3. 3.
Randomly select a small subset of documents from the training corpus, denote the set of documents as . The random selection of documents will not stop until any word in the vocabulary appears at least once in the selected documents. 4. 4.
Repeatedly run the -means clustering algorithm against to build a tree of clusters. 5. 5.
Initialize a NHDP tree for , call the initial NHDP topic tree as , and let . 6. 6.
Randomly select a subset of documents from the training corpus, denote the set of documents as . 7. 7.
Make adjustment to based on an inference algorithm against . The result is a topic tree . 8. 8.
Repeat steps 6 and 7 until converges.
From steps 3 to 5, we provide the maximum height and the maximum number of nodes at each level of tree . The maximum height and number of nodes at each level should be greater than the final tree. Following the assumption that words are interchangeable, we convert a document to the term-frequency form, i.e., a vector where each element is the frequency of the corresponding word appearing in the document. In Step 4, we use the K-means clustering algorithm to divide the documents into a number of clusters, and for each cluster, we estimate a topic distribution. These clusters and the topic distributions are the top level nodes in tree just beneath the root. We then repeat the process for each cluster, and each cluster is further divided into a number of subclusters. For each subcluster we estimate a topic distribution. This step is for computational efficiency. Given the number of clusters and the depth of the tree, the -means algorithm builds a large tree quickly. This tree serves as the initial tree for the NHDP algorithm that learns the switching probabilities for different levels and the switching probabilities for different clusters at a level, which effectively shrinks the tree by learning the switching probabilities. Note in the above when applying the K-means algorithms, we adopt the distance, i.e., given two documents represented as two vectors and , the distance of the two documents is .
Steps 6 to 8 perform a randomized batch inference processing. Agrawal et al. demonstrate that topic modeling can suffer from “order effects”, i.e., a topic modeling algorithm yields different topics when we alter the order of the training data agrawal2018wrong . This randomized batch processing can reduce this “order effects” via averaging over different random orders of the training data set. Step 7 requires a specific inference algorithm. In Blei:2010:NCR:1667053.1667056 ; doi:10.1198/016214506000000302 , Markov Chain Monte Carlo algorithms, specifically, Gibbs samplers are used. In this work, we used the variational inference algorithm in 6802355 . Variational inference algorithms are typically shown to scale better to large data sets than Gibbs samplers do. Steps 6 to 8 can begin with an arbitrary tree, however, it is much more computationally efficient to initialize the inference algorithm with a tree that shares statistical traits with the target data.
4 Evaluation
For evaluation, we use field interaction traces from ABB RobotStudio, a popular IDE intended for robotics development that supports both simulation and physical robot programming using a programming language called RAPID. RobotStudio as an IDE also runs robot application programs developed in the IDE by users. It is RobotStudio that collects interaction traces other than the robot applications do. The RobotStudio interaction trace dataset we used represents users over a maximum of months of activity, or a total of user-work hours. In the interaction traces, there are unique messages, types of exceptions, sessions, and unique stack traces, resulting in documents of messages. Note that a single exceptions in RobotStudio is often triggered by numerous users of the IDE, as such, an exception corresponds to many unique stack traces and each unique stack trace has many copies. We chose the window size of messages based on empirically observing this to result in semantically interesting windows, which commonly represent a single activity by a developer Damevski_Predicting_2017 .
The RobotStudio dataset consists of sequences of time-stamped messages, where each message corresponds to a RobotStudio command (e.g., RapidEditorShow) or an event representing application state (e.g., Exception and StartingVirtualController). Messages have additional attributes, such as the component that generates the command or the event, and the command or event type. RobotStudio records the stack traces directly into the interaction log, so the two distinct data types considered here are already combined into one single trace.
The evaluation plan is as follows. First, we conduct a “held-out” document evaluation, i.e., we divide the documents into two sets, training dataset to learn the model, and held-out dataset to test the model. The purpose of the held-out document evaluation are two-fold. We want to know whether the training data set is sufficient to produce a stable model and to assess whether the parameters used in the learning process is reasonable. Second, we conduct a user survey to assess the usefulness of the model in understanding and debugging software faults. Figure 3 illustrates the overall processing pipeline used for evaluation.
4.1 Held-out Document Evaluation
Unsupervised learning algorithms, like NHDP, are typically more challenging to evaluate, as there is no ground truth to compare to. Perplexity and predictive likelihood are two standard metrics for informational retrieval evaluation that corresponds to a model’s ability to infer an unseen document from the same corpus. These two are a single metric in two different representations since perplexity is, in effect, the inverse of the predictive power of the model. The worse the model is, the more perplexed it is with unseen data, resulting in greater values for the perplexity metric. Similarly, the better the model is, the more likely that the model is able to infer the model of an unseen document. To further explain these two concepts and their relationship and how we compute them, let us divide the dataset into two subsets, one is a training dataset that we consider as observed, and the other a held-out dataset that we consider as unseen. We denote the former as and later as . We consider has documents, and , and has documents, and . Given that we learn a model from the training dataset , we define the predictive power of the learned model is the following conditional probability, i.e., the probability of observing the unseen documents given the model learned from the observed document,
[TABLE]
where we assume that held-out documents are independent to one another.
Since the probability in equation (4.1) varies on the size of the held-out dataset, , the probability is not comparable for held-out datasets of different sizes. To make it comparable among held-out dataset of different sizes, we take a geometric mean of the probability as follows,
[TABLE]
where is the sum of all word counts in document .
We call the predictive likelihood of the model on the unseen dataset . We can then define the predictive log likelihood as,
[TABLE]
and define the perplexity as the inverse of the predictive likelihood,
[TABLE]
which establish the correspondence between perplexity and predictive log likelihood.
In the following, we describe the procedure to compute the perplexity and show the result. This evaluation method, inspired by earlier work in Wallach:2009:EMT:1553374.1553515 ; Rosen-Zvi:2004:AMA:1036843.1036902 , is frequently used to evaluate topic models, such as in 6802355 ; NIPS2014_5303 . The procedure below is adopted from 6802355 .
Form training and testing datasets. We divide interaction traces into a training dataset and a testing dataset based on a reasonable ratio , e.g., . To obtain the training dataset, randomly select documents from the documents of interaction traces. The remaining documents are in the testing dataset. 2. 2.
Form observed dataset and held-out dataset. Select a document partition ratio , e.g., . For each document in the testing dataset, and the appearances of words in the document, partition into two partitions. The first words goes to the first partition, and the second words the second partition. Consider the two partitions as two documents, and . Then all the form the held-out dataset and all the forms all the observed dataset, i.e., we obtain and in equation (4.1). 3. 3.
Train the model. Use NHDP on the training dataset, i.e., infer the global topic tree using the training dataset. The model is in equation (4.1). 4. 4.
Compute perplexity. Use the definition in equation (4.1).
Figure 4 is a result of the perplexity obtained when we gradually increase the number of documents seen and the use the rest as the testing data. We take an approach inspired by -fold cross-validation. For each training dataset size, we randomly select the training dataset from the collected dataset and then compute the perplexity. We illustrate computed perplexities at each training dataset size in an - plot with error bar in Figure 4. The figure shows that both the perplexity and the variation of the perplexity decreases as training dataset size increases, indicative of the convergence of the algorithm and a stable model. In particular, when the document seen is at of documents, we observe a significant drop of perplexity, and the magnitude of the drop is consistent with those in the topic modeling literature, such as, NIPS2014_5303 ; 6802355 ; Blei:2010:NCR:1667053.1667056 . This suggests that the obtained model has converged to a stable state and the model provides a stable representation of the underlying data. We can now use the model for the purpose of interpreting the context of software exceptions.
4.2 Sensitivity Analysis
As a Bayesian hierarchical model, for NHDP, we infer marginal and conditional probability distributions from the data for the parameters in the model, as such, the model does not overfit. As a non-parametric model, we parametrize the model with infinite number of parameters, as such, the model does not underfit (gelman2014bayesian, , page 101).
One challenge is that we specify the prior of a Bayesian non-parametric model by giving the values of the hyperparameters of the prior and the values are sometimes difficult to choose. We ought to assess the effect of these values. A common method is via sensitivity analysis. This is particularly important for Bayesian hierarchical models roos2015sensitivity . For sensitivity analysis, we examine how the hierarchy obtained varies with hyperparameters in the prior. Their values control the base distribution in the NHDP process, and the switching probabilities between levels of the tree. For a document, we draw the topics at a node from a Dirichlet distribution, specifically, draw them from , a symmetric Dirichlet distribution controlled by the concentration parameter ; however, we need to choose which branch to visit to draw topics for its children, for which we must know hyperparameter . When we generate a document, we decide whether to go to next level of the tree based on Beta distribution, . We explain the effects of these parameters in Section 3.
We use a number of statistics to evaluate how sensitive the learned model is to the hyperparameters. These statistics include the number of topics at each level of the tree for each document and the number of words at each topic. Figure 5 shows the average number of topics per document at tree levels 1, 2, and 3 when we increase hyperparameter from to when we infer the model from a set of of randomly selected documents. The graph shows that the inferred model is insensitive to the hyperparameter .
Figure 6 shows the average number of topics per document at tree levels 1, 2, and 3 when we increase hyperparameter from to and hold . It shows that the model is somewhat sensitive to and ; however, the variation of the number of topics is mostly less than , which is not a major change, particularly for the average number of topics at levels and .
In summary, these sensitivity tests indicate that the inferred model is robust as it tolerates uninformed selections of hyperparameters. The hyperparameters does have an impact on the learned tree structure, but only in a minor way. A specific caution is that one should choose and with more care than do . Practically, one may compare the perplexities at different values of and , and elect the pair with lower perplexity.
4.3 Example RobotStudio Topic Hierarchy
The result of our approach is a topic hierarchy learned from the combined interaction and software crash dataset. The tree hierarchy communicates a succinct model of the observed interactions, where each topic represents a group of commonly co-occurring interactions and the hierarchy encodes a relationship between general or popular topics and ones that are more specific and rare.
One may explore the hierarchy either bottom-up or top-down to observe its structure, or begin with a specific event, such as an exception or stack trace, and move in both directions with the idea of understanding the context of user behavior that produces the exception. For instance, Figure 7LABEL:sub@subfig-2:robapi shows a topic hierarchy learned from the dataset centered on an exception. The hierarchy shows a parent topic and two of its child topics. Since the messages with dominant probabilities are about simulation, we interpret the parent topic to indicate that developers are starting, stopping, and stepping through a simulation using RobotStudio. The two child topics exhibit two sub-interactions when the user is doing the simulation. The first child, illustrated immediately below its parent indicates that the user conducts a conveyor simulation. The second child indicates that the simulation includes the user’s action that leads the simulated robot moving to a different location, which is often accompanied with saving project state, perhaps, because it is prudent to save the project state before a path change. Thus, we may conclude that this topic hierarchy suggests that the user starts with a more generic activity, simulating a robot, and the simulation consists of multiple sub-interactions. It also shows that the exception indicated by the message RobApiException often occurs with the simulation for controlling a conveyor.
4.4 RobotStudio Developer Survey
In order to assess the interpretability and value of our technique, we conducted a survey of RobotStudio developers using the model we extracted from the user interaction dataset of this application. Note that they are the individuals who develop and maintains RobotStudio, and are not users who use RobotStudio in production. One important goal is to help the developers from using the model built from the data collected from the users. The survey consisted of a sample of five random RobotStudio exceptions that we show to the developers one at a time together with their surrounding context hierarchy in the survey.
We sent the composed survey via e-mail to the entire RobotStudio development team. The team consists of 17 individuals, out of which we received 6 responses. All but one of the respondents had 3 or more years of experience on the RobotStudio team and all of them had worked as software developers for at least 3 years. Five out of six respondents were familiar with the RobotStudio interaction dataset, and had examined it in the past, and all of them believed that knowing which commands in the interaction log an exception co-occurs with could be helpful in debugging. Figure 7 displays two of the images shown in the survey, which depict an exception and its nearby surrounding command context hierarchy. Below, we highlight the salient conclusion from the study, coupled with the evidence to support them, including any additional relevant explanation extracted from open-ended questions in the survey.
The model was very useful for understanding and debugging some exceptions, but not useful for others. The survey showed a strong variance between the responses for the usefulness of specific parts of the model and specific exceptions. For instance, for RobApiException, listed in Figure 7LABEL:sub@subfig-2:robapi, the respondents rated the usefulness of the usage context in understanding the exception an average of 7.83 (s = 1.52) on a scale of 1 (least useful) to 10 (most useful). This high rating can be contrasted to the usefulness rating received by the usage context of the remaining 4 exceptions: FormatException - 4.0/10.0 (s = 2.83); ApplicationException - 3.66/10.0 (s = 3.44); KeyNotFoundException - 4.0/10.0 (s = 1.3); GeoException - 3.83/10.0 (s = 2.92). Three of the developers already formed the same hypothesis for the fault by examining the model for RobApiException, stating the following:
[…] VC returns an error saying that we cannot set the program pointer to main in the current execution state. Perhaps RobotStudio tries to move the program pointer when it is in running state.
For the less useful exception models, a number of the RobotStudio developers suggested a concrete set of improvements that they believed would raise its level of usefulness, including labeling each of the contexts and providing additional command characteristics, whenever available, to make the model clearer. For instance, one participant stated:
“Its like watching the user over the shoulder but too far away. I can see which tools and windows he or she opens, which commands are issued. But I cannot see any name of an object, no version number of a controller, no file name, not really anything concrete and specific. I think that needs to be tied in.”
Additionally, the survey result that some exceptions are more useful while the others are not based on the users’ ratings may be in part attributed to the following observation. Some exceptions, e.g., FormatException and KeyNotFoundException may actually not results of program faults because programmer often use them for input validation222See the Stack Overflow discussion “Is it a good or bad idea throwing Exceptions when validating data?” at https://stackoverflow.com/questions/1504302/is-it-a-good-or-bad-idea-throwing-exceptions-when-validating-data and many other discussions on the subject.. And yet, when asked about FormatException, one developer stated:
“[…] it tells me that the user explicitly or implicitly (as far as I remember it is always done explicitly) was loading a distribution package. The package has it version number defined as part of the root folder name. The version part of the folder name could not be parsed to a .NET Version object.”
In contrast, the developers view exceptions like RobApiException and their corresponding stack traces are more useful because these exceptions are about the movement and the control of the industrial robot, and perceive them as the results of actual program faults as discussed above.
5 Threats to Validity
This paper presents an exploratory study of using hierarchical topic modeling on large-scale interaction data for the purpose of building a hierarchy of usage contexts surrounding stack traces. Such contexts can be useful to understand or debug software faults that exhibit specific stack traces. The assumptions embedded in a hierarchical topic model, such as, the “bag of words” assumption for words in a document, the windowing method, and the modeling approach, are a source of internal threats to validity of our study. To mitigate this threat, we follow prior established techniques for applying topic models. Also, prior studies have successfully analyzed interaction data using topic models with the “bag of words” assumption and a windowing method Damevski_Predicting_2017 ; 7515925 .
In our study, we relied solely on RobotStudio interaction traces to build our model. Therefore, our study’s results may not transfer to other interaction traces or platforms. To mitigate this threat we posit that the long timespan and large scale of the Robot Studio interaction traces, including this development environment’s use of extensions that extend its capability, offer a significant amount of diversity to our technique.
We surveyed RobotStudio developers to evaluate the usefulness of our hierarchy of contexts. Although the evaluation shows positively that the hierarchy is helpful to debug software faults, the survey sample size is too small to provide robust and generalizable conclusions.
The work also suffers from external threats to validity because we surveyed developers to assess the usefulness of the hierarchy of usage contexts. One threat is that the surveyed developers may be prone to offer positive answers as they know that we will analyze their responses to the survey, i.e., the observer effect. The other is that our approach may be new to them, and this novelty may influence them to respond positively. To mitigate this threat, we followed standard approaches for creating developer surveys and frequently prompted the survey respondents to specify a rationale for their opinions.
6 Related Work
Although researchers have applied topic models to analyze software engineering data Chen2016 ; Panichella:2013:EUT:2486788.2486857 ; 7515925 ; Damevski_Predicting_2017 , they have not explored hierarchical topic models, in particular, Bayesian non-parametric hierarchical topic models that offers severarl advantages to analyze software engineering data, such as interaction traces. We focus our related work discussion on the set of prior work that exists, separately, for both of the data types used in this work, i.e., for mining and understanding both application crash reports and interaction data.
As interaction data is large-scale, consisting of multiple messages per minute of user interaction with the application, a common goal is to extract high-level behaviors from the data that express common behavioral patterns exhibited by a significant cluster of users. Numerous approaches have been suggested to extract such behaviors from IDE data, using hidden Markov models, sequential patterns, Petri nets, and others Damevski:2016:IED:2901739.2901741 ; Murphy-Hill:2012:ISD:2393596.2393645 ; 1316839 , with the purpose of extracting high-level common behaviors exhibited by developers in the field. Our prior work explores the use of the Latent Dirichlet Allocation topic modeling technique, more specifically its temporal variant, for the prediction and recommendation of IDE commands for a specific developer Damevski_Predicting_2017 .
Mining software crash reports have been a popular area of study in recent years, with the ubiquity of systems that collect these reports and the availability of public datasets. Here we highlight only the most relevant studies, which focus on mining exceptions and stack traces in a corpus of crash reports.
Han et al. built wait graphs from stack traces and other messages to diagnose performance bugs Han:2012:PDL:2337223.2337241 . Dang et al. clustered crash reports based on call stack similarity Dang:2012:RMC:2337223.2337364 , while Wu et al. located bugs by expanding crash stack with functions in static call graphs from crash reports that contains stack traces Wu:2014:CLC:2610384.2610386 . Davie et al. researched whether a new bug in the same source code as known bug can be found via bug report similarity measures 6385108 .
Crash reports that contains stack traces can be too numerous for engineers to manage. Dhaliwal et al. investigated how to group crash reports based on bugs 6080800 . Kaushik and Tahvildari applied information retrieval methods or models to detect duplicate bug reports. They compared multiple information retrieval methods and models including both word-based models and topic-based models 6178863 . Williams and Hollingsworth used source code change history of a software project to drive and help to refine the search for bugs 1463230 .
Since bug reports are duplicative and prior knowledge may be used to fix new bugs, crash reports can help reuse debugging knowledge. Gu et al. created a system to query similar bugs from a bug reports database Gu:2012:RDK:2384616.2384684 .
Different from prior work, our aim here is to produce a contextual understanding of stack traces, and their relationship with user interactions. This is based on a large set of interaction traces with embedded stack traces, where a stack trace can be considered as a special message in the interaction traces. While in this paper we always assume a dataset with already combined interaction and stack traces, they need not be a priori, as long as relatively reliable timestamps exist in both data sources. The proposed approach is also resilient to minor clock synchronization issues that may arise if combining stack traces and interaction traces that are collected on disparate machines, since it does not require perfect message ordering.
7 Conclusions
Large quantities of software interaction traces are gathered from complex software daily. It is advantageous to leverage such data to improve software quality by discovering faults, performance bottlenecks, or inefficient user interface design. We posit that high-level comprehension of these datasets, via unsupervised approaches to dimension reduction, is useful to improving a myriad of software engineering activities. In this paper, we aim at modeling a large set of user interaction data combined with software crash reports. We leverage a combined dataset collected from ABB RobotStudio a software application with many thousands of active users. The described approach is novel in attempting to model the combination of the two datasets.
As a modeling technique, hierarchical models, such as, the Nested Hierarchical Dirichlet Process (NHDP) Bayesian non-parametric topic model enable human interpretation of complex datasets. The model allows us to extract topics, i.e., probability distributions of interactions and crashes, from the document collections and assemble these topics into tree-like structure. The hierarchical structure of the model allows browsing from a more generic topic to a more specific topic. The tree also reveals certain structure among users’ interaction with the software. Most importantly, the structure also demonstrates an understanding how an exception co-occur with other messages, and thus provide a context on these messages. We surveyed ABB RobotStudio developers who consistently found parts of the model very useful, although significant more work is required to understand and predict the parts of the model that yielded no insight to the developers. The future work also includes investigating semi-supervised learning models that can leverage developer feedback in formulating an interpretable and useful model.
Acknowledgements.
The authors would like to thank the RobotStudio team at ABB Inc for providing the interaction dataset and responding to the survey. The authors are also grateful to the anonymous reviewers’ constructive comments.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1) van der Aalst, W., Weijters, T., Maruster, L.: Workflow mining: discovering process models from event logs. IEEE Transactions on Knowledge and Data Engineering 16 (9), 1128–1142 (2004). DOI 10.1109/TKDE.2004.47
- 2(2) Agrawal, A., Fu, W., Menzies, T.: What is wrong with topic modeling? and how to fix it using search-based software engineering. Information and Software Technology 98 , 74–88 (2018)
- 3(3) Arnold, D.C., Ahn, D.H., De Supinski, B.R., Lee, G.L., Miller, B.P., Schulz, M.: Stack trace analysis for large scale debugging. In: 2007 IEEE International Parallel and Distributed Processing Symposium, p. 64. IEEE (2007)
- 4(4) Blei, D.M., Griffiths, T.L., Jordan, M.I.: The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. J. ACM 57 (2), 7:1–7:30 (2010). DOI 10.1145/1667053.1667056 . URL http://doi.acm.org/10.1145/1667053.1667056
- 5(5) Blei, D.M., Moreno, P.J.: Topic segmentation with an aspect hidden markov model. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’01, pp. 343–348. ACM, New York, NY, USA (2001). DOI 10.1145/383952.384021 . URL http://doi.acm.org/10.1145/383952.384021
- 6(6) Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine Learning research 3 (Jan), 993–1022 (2003)
- 7(7) Cao, L., Fei-Fei, L.: Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8 (2007). DOI 10.1109/ICCV.2007.4408965
- 8(8) Chen, T.H., Thomas, S.W., Hassan, A.E.: A survey on the use of topic models when mining software repositories. Empirical Software Engineering 21 (5), 1843–1919 (2016). DOI 10.1007/s 10664-015-9402-8 . URL http://dx.doi.org/10.1007/s 10664-015-9402-8
