Enhancing Trace Visualizations for Microservices Performance Analysis
Jessica Leone, Luca Traini

TL;DR
This paper introduces a new visualization method for microservices performance analysis that aggregates multiple requests, improving understanding beyond individual request tracing.
Contribution
The paper proposes an innovative visualization approach that extends existing techniques to enable aggregate analysis of multiple microservice requests.
Findings
Preliminary progress in developing the visualization approach.
Enhanced understanding of system performance through aggregate analysis.
Extension of request comparison techniques for broader performance insights.
Abstract
Performance analysis of microservices can be a challenging task, as a typical request to these systems involves multiple Remote Procedure Calls (RPC) spanning across independent services and machines. Practitioners primarily rely on distributed tracing tools to closely monitor microservices performance. These tools enable practitioners to trace, collect, and visualize RPC workflows and associated events in the context of individual end-to-end requests. While effective for analyzing individual end-to-end requests, current distributed tracing visualizations often fall short in providing a comprehensive understanding of the system's overall performance. To address this limitation, we propose a novel visualization approach that enables aggregate performance analysis of multiple end-to-end requests. Our approach builds on a previously developed technique for comparing structural differences…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Enhancing Trace Visualizations for
Microservices Performance Analysis
Jessica Leone
University of L’Aquila, Italy
and
Luca Traini
University of L’Aquila, Italy
Abstract.
Performance analysis of microservices can be a challenging task, as a typical request to these systems involves multiple Remote Procedure Calls (RPC) spanning across independent services and machines. Practitioners primarily rely on distributed tracing tools to closely monitor microservices performance. These tools enable practitioners to trace, collect, and visualize RPC workflows and associated events in the context of individual end-to-end requests. While effective for analyzing individual end-to-end requests, current distributed tracing visualizations often fall short in providing a comprehensive understanding of the system’s overall performance. To address this limitation, we propose a novel visualization approach that enables aggregate performance analysis of multiple end-to-end requests. Our approach builds on a previously developed technique for comparing structural differences of request pairs and extends it for aggregate performance analysis of sets of requests. This paper presents our proposal and discusses our preliminary ongoing progress in developing this innovative approach.
Microservices, Distributed Tracing, Performance Analysis
††copyright: none††ccs: Software and its engineering Software performance††ccs: Software and its engineering Maintaining software††ccs: Software and its engineering Software evolution††ccs: Human-centered computing Visual analytics††ccs: Human-centered computing Visualization toolkits
1. Introduction
Microservices have emerged as a paradigm shift in software development, enabling the efficient testing and deployment of new software features and improvements (Newman, 2015). Due to their modular nature, microservices are particularly well-suited for today’s software market, where the ability to rapidly release software updates and improvements is perceived as a key competitive advantage (Rubin and Rinard, 2016).
However, despite these advantages, adopting microservices also poses significant challenges. A main challenge, for example, is ensuring adequate software performance. Indeed, a proactive approach to performance assurance (e.g., pre-production performance testing (Jiang and Hassan, 2015; Laaber and Leitner, 2018; Traini et al., 2022)) is often impractical in these contexts, given the inherent complexity of these systems (Sridharan, 2017; Veeraraghavan et al., 2016), and the significant pressure to deliver fast-to-market (Rubin and Rinard, 2016; Traini, 2022). Performance behavior of microservices systems usually emerges in the field (Veeraraghavan et al., 2016) as a complex interaction of several RPCs that spans multiple independent services and machines. Furthermore, these systems are constantly undergoing software changes (e.g., multiple releases per day) and are exposed to highly varying workloads (Ardelean et al., 2018), which make them susceptible to unexpected performance regressions (Veeraraghavan et al., 2016; Traini et al., 2021).
To aid performance analysis, practitioners typically rely on distributed tracing (Parker et al., 2020) (e.g., Jaeger (Technologies, 2023), Dapper (Sigelman et al., 2010), Canopy(Kaldor et al., 2017)) as a means to monitor and inspect microservices performance in production. Distributed tracing tools capture the workflow of causally-related events (i.e., work done to process a request) within and among the components of a microservices system (Sambasivan et al., 2016), and provide visualizations to support performance analysis of individual end-to-end requests, e.g., swimlane visualizations (Davidson et al., 2023; Sigelman et al., 2010; Technologies, 2023).
Although these visualizations are particularly effective for analyzing individual requests, they are limited when it comes to provide a comprehensive picture of microservices performance. Understanding request performance in aggregate (Parker et al., 2020) often requires switching between various visualization tools, making the process cumbersome and time-consuming (Davidson et al., 2023). Current distributed tracing tools do not provide out-of-the-box support for aggregate performance analysis.
In this paper, we propose a novel visualization approach that facilitates comprehension of the relationship between request characteristics and end-to-end response time behavior. The proposed approach enables seamless performance analysis of multiple requests, eliminating the need to switch between multiple tools and visualizations. Our approach builds upon a previously developed visualization method for comparing request pairs (Farro, 2018) and extends it to accommodate various types of aggregate analysis on sets of requests. This new approach will empower engineers to swiftly identify significant and recurring request characteristics that are linked to specific end-to-end response time behavior, thereby enabling a more effective performance analysis of microservices. Here, we present our proposal and report on our preliminary ongoing progress in developing this novel visualization approach.
2. Motivation
Distributed tracing tools provide swimlane visualizations (Kaldor et al., 2017; Sigelman et al., 2010; Technologies, 2023; Davidson and Mace, 2022) as a means to investigate end-to-end response time (refer to Fig. 1 for an illustration). These visualizations are particularly useful for performance analysis of individual requests. For example, they can be used to investigate how each RPC contributes to the end-to-end response time of a specific request, and pinpoint potential bottlenecks.
Despite their usefulness, swimlane visualizations have certain limitations when it comes to more complex performance analyses. For instance, analyzing the performance of individual end-to-end requests can be misleading without considering the appropriate context (Davidson et al., 2023). The response time of a request can only be considered anomalous when compared to other requests of the same type (Anand et al., 2020). Additionally, engineers are often more interested in investigating overall response time trends rather than the performance of individual requests (Parker et al., 2020). Different end-to-end response time behaviors can be associated with specific request characteristics, such as particular RPC execution paths and/or RPC performance behaviors, and engineers can be interested in identifying these characteristics to spot potential performance bugs and/or optimization opportunities (Parker et al., 2020; Krushevskaja and Sandler, 2013; Cortellessa and Traini, 2020).
Distributed tracing tools lack support for this type of analysis, which currently requires the use of multiple visualizations and tools in tandem (Davidson et al., 2023), e.g., Jaeger (Technologies, 2023) and Kibana (Elastic, 2023b). A common approach is to first recognize recurrent performance behaviors that warrant investigation, and then examine individual requests to characterize relevant performance behaviors. This is usually achieved by detecting “modes” of the end-to-end response time distribution (e.g., using Kibana), which indicate meaningful recurrent performance behaviors. Then, samples of requests associated with each mode can be extracted and individually examined (e.g., using Jaeger’s swimlanes) to identify unique characteristics that lead to specific performance behaviors, i.e., modes.
Unfortunately, this approach can be particularly tedious as it necessitates the manual examination and comparison of multiple requests across multiple visualizations and tools. Furthermore, even when successful, it falls short of providing a satisfactory level of confidence. Identifying specific characteristics of a particular mode requires verifying that these characteristics appear exclusively in requests belonging to that mode, which can be challenging using current tools.
For instance, consider the scenario illustrated in Fig. 2, which depicts the distribution of the end-to-end response time for a specific type of request, such as loading the homepage of a website. As depicted in the figure, requests exhibit three distinct response time behaviors, i.e., modes. Suppose that the rightmost mode is characterized by a unique request characteristic, such as a specific RPC execution path. That is, a specific RPC is invoked in all requests belonging to the rightmost mode, but not in the others (e.g., due to a cache miss). Identifying patterns like this can be particularly challenging with current distributed tracing tools.
3. Proposal
We present a novel interactive visualization approach for aggregate performance analysis of end-to-end requests. Our approach provides a dashboard comprising two main interactive components: a tree and a histogram (see Fig. 3). The interactive tree offers an overview of the requests’ workflows in terms of RPC execution paths, while the histogram illustrates the corresponding end-to-end response time distribution. The key insight behind this approach is to enable the simultaneous analysis of various aspects of requests and the relationship among these aspects through a unified dashboard.
The tree visualization proposed in our approach builds upon a previous visualization technique provided in Jaeger (Technologies, 2023), which enables the comparison of two requests in terms of RPC execution paths (Farro, 2018). We extend this technique by proposing a tree visualization that enables the analysis of an entire set of requests. Each node of the tree represents a RPC invocation within a specific execution path, where the leftmost node represents the root RPC, and edges indicate direct RPC invocation. For example, in Fig. 3, the red node labeled as getcheapestroute indicates the RPC execution path getbycheapest getleftticketgetcheapestroute. It is worth noting that when a particular RPC invokes the same RPC multiple times, this leads to a single node in the tree. In other words, if the RPC getbycheapest invokes the RPC getleftticket multiple times, there will be only one child node referring to getleftticket. The rationale behind this approach is to address the potential complexity of requests by providing a more concise visualization. An RPC execution path will appear in the tree only if it is present in at least one request within the set of requests being analyzed.
Similar to the trace comparison tool of Jaeger (Farro, 2018), we utilize color encoding to highlight variations within a set of requests. However, in contrast to this approach, our color encoding scheme denotes the degree of variance of a particular request characteristic within a group of requests, rather than the difference between two requests. Furthermore, we employ a continuous color scale, as opposed to the discrete one used by Jaeger (Farro, 2018). As an example, we can use color encoding to depict the variability in the number of invocations of a specific RPC execution path within a group of requests. A white node indicates that the corresponding RPC execution path appears the same number of times in all the requests under analysis. Conversely, a red node indicates a substantial change in the number of invocations of the corresponding RPC execution path from one request to another. We use standardized measures of dispersion, such as the Coefficient of Variation (CV) (Everitt, 1998), to encode these differences into colors. For instance, a CV of 0 results in a white node, while a CV 1 results in a red node. The shade of color gradually transitions from white to red as the CV value increases, with 0 CV 1.
This approach facilitates the rapid identification of RPC execution paths that exhibit divergent request characteristics. An engineer can delve deeper into a specific RPC execution path by interacting with its corresponding node, bringing up a new visualization such as the pie chart displayed in Fig. 3. This allows for a more in-depth examination of the divergent characteristics observed in that specific RPC execution path. The pie chart in Fig. 3 illustrates, for instance, two distinct behaviors, with 77% of requests not involving any invocations of the path getbycheapest getleftticketgetcheapestroute, and 23% of requests involving one invocation of such path. By interacting with the corresponding slice of the pie chart, the tool illustrates how this particular request characteristic affects end-to-end response time by highlighting the corresponding area of the histogram. For instance, Fig. 3 shows that the rightmost mode of the end-to-end response time distribution is characterized by the presence of the particular RPC execution path getbycheapest getleftticketgetcheapestroute.
Another advantage of this approach is its versatility in analyzing various request characteristics. This can be accomplished by changing the color encoding assigned to tree nodes. For instance, the method can be adapted to examine how different RPC execution times influence the distribution of end-to-end response time. Or, similarly, it can be utilized to investigate the impact of different HTTP header values on response time (in these situations, other statistical metrics such as the Gini index (Gini, 1936) or Entropy (Shannon, 1948) can be applied). Furthermore, we plan to make the analysis bidirectional, allowing the engineer to begin their investigation from either the tree or the histogram to understand response time behavior. If the engineer chooses to start from the response time distribution, they can select a specific range using a slider selector. This selection will result in an updated color scheme on the tree, where the colors will now indicate the divergence of the selected set of requests compared to all others (using statistical measures such as Kullback-Leibler divergence (Kullback and Leibler, 1951)). This approach will enable engineers to identify how certain modes of response time distribution correlate with specific request characteristics.
We envision two potential challenges for the development of our visualization approach. The first challenge concerns the efficiency of our tool. Modern distributed tracing tools collect huge volumes of traces per day (Kaldor et al., 2017), which can make the interaction with the proposed visualizations computationally intensive. In order to enable an efficient generation of visualizations, we plan to preprocess traces in an optimized format. Figure 4 depicts the workflow of our approach. The traces of the microservices system are continuously collected by the distributed tracing tool (i.e., Trace Collector) and persisted in a Trace Storage, such as ElasticSearch (Elastic, 2023a). The stored traces are then periodically preprocessed and persisted in dedicated storage in an optimized format. The visual dashboard application can then directly query the optimized trace storage to enable the efficient generation of visualizations.
The second challenge we envision is related to the user experience. When too much information is displayed on the screen, the user can become overwhelmed (Davidson et al., 2023). To mitigate this, we intend to keep our dashboard as minimalistic as possible to avoid placing an excessive burden on the user. For instance, certain request types may involve a large number of RPCs, which can result in an extremely large tree structure. To reduce the user’s cognitive load, we will allow the user to hide the RPCs invoked within a particular execution path by double-clicking on the corresponding node. This will enable the user to navigate the tree more easily.
4. Ongoing work
We are currently engaged in the development of a prototype for our visualization approach. For our pilot study, we are utilizing a dataset of traces that was generated in a previous work of the second author (Traini and Cortellessa, 2021). The dataset was created using the widely-used microservices benchmark system TrainTicket (Zhou et al., 2021; Li et al., 2021; Zhang et al., 2022; Guo et al., 2020; Zhou et al., 2019), and the load generator PPTAM (Avritzer et al., 2019). Our technology stack includes the use of Jaeger (Technologies, 2023) as the distributed tracing collector and Elasticsearch (Elastic, 2023a) for storing both common and optimized traces. The dashboard has been developed using D3.js (Bostock, 2023) for visualizations and Flask (Pallets, 2023) for the backend service.
At present, we have accomplished the implementation of the UI of our working prototype, along with its key visualization components, namely the interactive tree and histogram. However, we are still engaged in the ongoing development of the preprocessing engine and a portion of the server infrastructure. For this reason, we are unable to report an authentic screenshot of our working prototype, at this stage. Nonetheless, we do provide a mock-up of the expected outcome in Fig. 3.
Our next step entails the completion of the first version of our prototype. This first version will be centered exclusively on the analysis of the number of invocations of RPC execution paths and their relationship with end-to-end performance behavior. In subsequent stages, we plan to extend the capability of the prototype beyond this specific request characteristic by exploring other metrics such as RPC execution time. Additionally, we intend to enable bidirectional analyses by supporting investigations initiated from both the tree and the histogram (currently, our working prototype only supports a singular direction, i.e., starting from the tree).
It is noteworthy that our current design only supports the individual inspection of specific request characteristics and does not enable the simultaneous analysis of multiple request characteristics in tandem. This supplementary capability may become crucial for more elaborate performance analyses. Thus, as a long-term goal, we intend to broaden our approach by enabling the analysis of combined sets of multiple request characteristics and their interrelation with end-to-end performance.
Regarding the evaluation of our proposed approach, we plan to continue using TrainTicket as the reference case study, given its relevance in the software engineering community. To further enhance the assessment of our proposed approach, we will consider a variety of workload scenarios, including some that involve synthetically injected performance bugs (Traini and Cortellessa, 2021). Furthermore, we aim to experiment with real-world distributed traces from large microservices systems to facilitate the transition of our approach to practice. For instance, we plan to utilize the dataset of traces recently shared by Alibaba (Luo et al., 2021).
5. Related work
Previous research on visualization approaches for distributed systems has traditionally focused on the analysis of individual requests or comparison between two requests. ShiViz (Beschastnikh et al., 2020) visualizes a logged distributed execution as a time-space diagram. The study of Sambavisan et al. (Sambasivan et al., 2013) compares three well-known visualization approaches in the context of presenting the results of one automated performance root cause analysis approach (Sambasivan et al., 2011). TraVista (Anand et al., 2020) builds on the prevalent approach of visualizing a single request by augmenting this view with information to aid the user in contextualizing the performance of one request with respect to other requests. Jaeger (Technologies, 2023) enables the comparison of structural aspects of two requests (Farro, 2018). Commercial APM tools (Ahmed et al., 2016), such as Dynatrace (Inc., 2023b), AppDynamics (Inc., 2023a), or Instana (IBM, 2023), provide features that support aggregated analysis of end-to-end requests. For instance, Dynatrace’s Service Flow feature (Inc., 2023c) shows aggregate workflows of end-to-end requests along with their associated characteristics. As far as we know, current APM tools do not provide dedicated visualizations to analyze the relationship between requests’ characteristics and end-to-end performance behavior. Other studies have proposed automated methods to aid the detection of patterns in requests’ characteristics associated with anomalous end-to-end performance behavior (Traini and Cortellessa, 2021; Bansal et al., 2020; Cortellessa and Traini, 2020; Krushevskaja and Sandler, 2013).
Recent research by Davidson and Mace (Davidson and Mace, 2022) emphasizes the importance of visualization in the context of systems research and encourages further effort into this topic. Davidson et al. (Davidson et al., 2023) performed a qualitative interview study to identify shortcomings of existing distributed tracing tools, and expose various open research problems that involve several domains (including visualization research).
6. Conclusion
In this paper, we presented a novel visualization approach for microservices performance analysis. Our approach addresses the limitations of current distributed tracing visualizations by enabling aggregate performance analysis of multiple end-to-end requests. Further studies are required to evaluate the effectiveness of our proposal and gather feedback from practitioners.
Acknowledgements.
This work is supported by“Territori Aperti” (a project funded by Fondo Territori Lavoro e Conoscenza CGIL, CSIL and UIL), and by European Union - NextGenerationEU - National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) - Project: “SoBigData.it - Strengthening the Italian RI for Social Mining and Big Data Analytics” - Prot. IR0000013 - Avviso n. 3264 del 28/12/2021. We would like to thank Antinisca Di Marco and Giovanni Stilo for the useful suggestions and comments that were helpful in improving the paper.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Ahmed et al . (2016) Tarek M. Ahmed, Cor-Paul Bezemer, Tse-Hsun Chen, Ahmed E. Hassan, and Weiyi Shang. 2016. Studying the Effectiveness of Application Performance Management (APM) Tools for Detecting Performance Regressions for Web Applications: An Experience Report. In Proceedings of the 13th International Conference on Mining Software Repositories (Austin, Texas) (MSR ’16) . Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/2901739.2901774 · doi ↗
- 3Anand et al . (2020) Vaastav Anand, Matheus Stolet, Thomas Davidson, Ivan Beschastnikh, Tamara Munzner, and Jonathan Mace. 2020. Aggregate-Driven Trace Visualizations for Performance Debugging. ar Xiv: 2010.13681 [cs] Number: ar Xiv:2010.13681.
- 4Ardelean et al . (2018) Dan Ardelean, Amer Diwan, and Chandra Erdman. 2018. Performance Analysis of Cloud Applications. In Proceedings of the 15th USENIX Conference on Networked Systems Design and Implementation (NSDI’18) . USENIX Association, USA, 405–417. event-place: Renton, WA, USA.
- 5Avritzer et al . (2019) Alberto Avritzer, Daniel Menasché, Vilc Rufino, Barbara Russo, Andrea Janes, Vincenzo Ferme, André van Hoorn, and Henning Schulz. 2019. PPTAM: Production and Performance Testing Based Application Monitoring. In Companion of the 2019 ACM/SPEC International Conference on Performance Engineering (Mumbai, India) (ICPE ’19) . Association for Computing Machinery, New York, NY, USA, 39–40. https://doi.org/10.1145/3302541.3311961 · doi ↗
- 6Bansal et al . (2020) Chetan Bansal, Sundararajan Renganathan, Ashima Asudani, Olivier Midy, and Mathru Janakiraman. 2020. De Caf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice (Seoul, South Korea) (ICSE-SEIP ’20) . Association for Computing Machinery, New York, NY, USA, 201–210. https://doi.org/10.1145/3377813.3381353 · doi ↗
- 7Beschastnikh et al . (2020) Ivan Beschastnikh, Perry Liu, Albert Xing, Patty Wang, Yuriy Brun, and Michael D. Ernst. 2020. Visualizing Distributed System Executions. ACM Trans. Softw. Eng. Methodol. 29, 2 (March 2020). https://doi.org/10.1145/3375633 Place: New York, NY, USA Publisher: Association for Computing Machinery. · doi ↗
- 8Bostock (2023) Mike Bostock. 2023. D 3: Data-Driven Documents. https://d 3js.org ”Accessed 2023-01-21 19:45”.
