Evaluating Asynchronous Semantics in Trace-Discovered Resilience Models: A Case Study on the OpenTelemetry Demo
Anatoly A. Krasnovsky

TL;DR
This paper presents a trace-based resilience modeling approach for microservices, incorporating asynchronous semantics, and evaluates its impact on availability predictions in a real-world demo environment.
Contribution
It introduces a method to derive connectivity models directly from OpenTelemetry traces, including asynchronous Kafka edges, and assesses their effect on availability estimates.
Findings
Asynchronous semantics for Kafka edges have minimal impact on availability predictions.
The model accurately reproduces the observed degradation curve under failures.
Explicit asynchronous modeling is unnecessary for immediate HTTP availability in this case.
Abstract
While distributed tracing and chaos engineering are becoming standard for microservices, resilience models remain largely manual and bespoke. We revisit a trace-discovered connectivity model that derives a service dependency graph from traces and uses Monte Carlo simulation to estimate endpoint availability under fail-stop service failures. Compared to earlier work, we (i) derive the graph directly from raw OpenTelemetry traces, (ii) attach endpoint-specific success predicates, and (iii) add a simple asynchronous semantics that treats Kafka edges as non-blocking for immediate HTTP success. We apply this model to the OpenTelemetry Demo ("Astronomy Shop") using a GitHub Actions workflow that discovers the graph, runs simulations, and executes chaos experiments that randomly kill microservices in a Docker Compose deployment. Across the studied failure fractions, the model reproduces the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
