JUCS - Journal of Universal Computer Science 29(11): 1274-1297, doi: 10.3897/jucs.112513
Distributed Tracing for Troubleshooting of Native Cloud Applications via Rule-Induction Systems
expand article infoArnak Poghosyan§, Ashot Harutyunyan|, Naira Grigoryan, Clement Pang
‡ VMware, Palo Alto, United States of America§ Institute of Mathematics of NAS RA, Yerevan, Armenia| Institute for Informatics and Automation Problems of NAS RA, Yerevan, Armenia¶ Yerevan State University, Yerevan, Armenia
Open Access
Abstract
Diagnosing IT issues is a challenging problem for large-scale distributed cloud environments due to complex and non-deterministic interrelations between the system components. Modern monitoring tools rely on AI-empowered data analytics for detection, root cause analysis, and rapid resolution of performance degradation. However, the successful adoption of AI solutions is anchored on trust. System administrators will not unthinkingly follow the recommendations without sufficient interpretability of solutions. Explainable AI is gaining popularity by enabling improved confidence and trust in intelligent solutions. For many industrial applications, explainable models with moderate accuracy are preferable to highly precise black-box ones. This paper shows the benefits of rule-induction classification methods, particularly RIPPER, for the root cause analysis of performance degradations. RIPPER reveals the causes of problems in a set of rules system administrators can use in remediation processes. Native cloud applications are based on the microservices architecture to consume the benefits of distributed computing. Monitoring such applications can be accomplished via distributed tracing, which inspects the passage of requests through different microservices. We discuss the application of rule-learning approaches to trace traffic passing through a malfunctioning microservice for the explanations of the problem. Experiments performed on datasets from cloud environments proved the applicability of such approaches and unveiled the benefits.
Keywords
cloud-native applications, application troubleshooting, distributed tracing, RED met-rics, root cause analysis, explainable AI, rule-induction systems, RIPPER