RuleKit: A Comprehensive Suite for Rule-Based Learning

Adam Gudy\'s; Marek Sikora; {\L}ukasz Wr\'obel

arXiv:1908.01031·cs.LG·January 28, 2020

RuleKit: A Comprehensive Suite for Rule-Based Learning

Adam Gudy\'s, Marek Sikora, {\L}ukasz Wr\'obel

PDF

1 Repo

TL;DR

RuleKit is a versatile, open-source software suite that facilitates rule-based learning for various predictive tasks, combining interpretability with flexible experimental options and user-guided induction.

Contribution

It introduces a comprehensive, user-friendly tool for rule learning applicable to classification, regression, and survival analysis, with flexible schemes and multiple interfaces.

Findings

01

Supports classification, regression, and survival analysis

02

Enables hypothesis verification through user-guided induction

03

Available as Java API, R package, and RapidMiner plugin

Abstract

Rule-based models are often used for data analysis as they combine interpretability with predictive power. We present RuleKit, a versatile tool for rule learning. Based on a sequential covering induction algorithm, it is suitable for classification, regression, and survival problems. The presence of a user-guided induction facilitates verifying hypotheses concerning data dependencies which are expected or of interest. The powerful and flexible experimental environment allows straightforward investigation of different induction schemes. The analysis can be performed in batch mode, through RapidMiner plug-in, or R package. A documented Java API is also provided for convenience. The software is publicly available at GitHub under GNU AGPL-3.0 license.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

adaa-polsl/RuleKit
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsInterpretability

Full text

RuleKit: A Comprehensive Suite for Rule-Based Learning

\nameAdam Gudyś \[email protected]

\nameMarek Sikora \[email protected]

\nameŁukasz Wróbel \[email protected]

\addrInstitute of Informatics, Silesian University of Technology, 44-100 Gliwice, Poland

Abstract

Rule-based models are often used for data analysis as they combine interpretability with predictive power. We present RuleKit, a versatile tool for rule learning. Based on a sequential covering induction algorithm, it is suitable for classification, regression, and survival problems. The presence of a user-guided induction facilitates verifying hypotheses concerning data dependencies which are expected or of interest. The powerful and flexible experimental environment allows straightforward investigation of different induction schemes. The analysis can be performed in batch mode, through RapidMiner plug-in, or R package. A documented Java API is also provided for convenience. The software is publicly available at GitHub under GNU AGPL-3.0 license.

Keywords: rule learning, classification, regression, survival analysis, user-guided induction, knowledge discovery

1 Introduction

Thanks to the combination of predictive and descriptive capabilities, rules have been applied in machine learning (especially in knowledge discovery) for decades. Amongst many rule induction strategies, sequential covering is one of the most popular (Fürnkranz et al., 2012). It consists in iterative addition of rules explaining a part of the training set as long as all the examples are covered. This approach leads to different models than those obtained by extracting rules from trees induced with a divide-and-conquer strategy (Breiman et al., 1984). In the previous research we confirmed the effectiveness of our variant of sequential covering strategy on dozens of data sets in classification, regression, and survival analysis (Wróbel et al., 2016, 2017). We also showed a usefulness of user-guided induction, which allows introducing user’s preferences or domain knowledge to the learning process (Sikora et al., 2019)—feature particularly valuable in medical applications.

In spite of numerous advantages, relatively few sequential covering rule induction algorithms are available as ready-to-use software. The examples are CN2 (Clark and Niblett, 1989) included in the Orange suite (Demšar et al., 2013), AQ (Michalski, 1969) implemented in Rseslib 3 (Wojna and Latkowski, 2019), or RIPPER (Cohen, 1995) and M5Rules (Holmes et al., 1999) contained in Weka (Witten et al., 2016).

We present RuleKit, a comprehensive suite for training and evaluating rule-based data models. Equipped with multiple useful features like user-guided induction, it is the first tool suitable for classification, regression, and survival analysis problems. It additionally stands out from the competitors with handiness—beside batch experimental environment it can be integrated with RapidMiner and R.

2 RuleKit Features

The following features make RuleKit a powerful data analysis tool:

(i)

Ability to resolve different problems: classification, regression, and survival analysis. 2. (ii)

Various ways to run the analysis: batch mode, RapidMiner plug-in, R package. 3. (iii)

Multiplicity of algorithm parameters. For instance, there are over 40 rule quality measures available with an additional possibility to define own formulas. 4. (iv)

Integrated experimental environment—the software facilitates automated investigation of various algorithm configurations over multiple data sets. Different experimental schemes (train-test, cross validation) are supported and tens of performance metrics are provided for model assessment. 5. (v)

User-guided induction—the possibility to specify the initial set of rules, preferred and forbidden conditions/attributes, together with the multiplicity of options and modes allow suiting the model to user’s requirements. This may be useful, e.g., in verifying hypotheses concerning data dependencies which are expected or of interest. 6. (vi)

Computational scalability—independent steps of induction algorithms (e.g., the evaluation of different conditions) are distributed over multiple threads allowing RuleKit to take advantage of multi-core CPUs, as well as high-performance clusters. Bit-level parallelism is also employed for maximum computational performance. 7. (vii)

Portability—the suite is distributed as Java application, thus it can be run on the majority of operating systems, including Windows, Linux, and OS X. 8. (vii)

Extensibility—the software together with the source code is publicly available at GitHub under GNU AGPL-3.0 Licence: https://github.com/adaa-polsl/RuleKit. The documented API allows straightforward integration of the library with other projects and/or extending its functionality.

3 Case Studies

Batch mode. This example demonstrates running a RuleKit batch analysis on deals classification data set (prediction whether a person making a purchase will be a future customer). The batch mode is run with java -jar RuleKit experiments.xml command, where XML file describes parameter sets and data sets to be investigated (Figure 1 a).

As a result of the training, a text report is produced (Figure 1 b). It contains a list of generated rules (with corresponding confusion matrices and statistical significance), information about examples coverage, model characteristics (no. of rules/conditions, average rule precision/coverage, etc.), and performance metrics calculated on the training set (accuracy, error, etc.). Depending on the problem, the significance of rules is established with different tests (Fisher’s exact, ${\chi}^{2}$ , or log-rank). The training may be followed by applying the model on unlabelled data. In this stage, a comma-separated table with values of performance metrics evaluated on the test set is produced.

RapidMiner plug-in. An alternative way of performing an experiment is integrating RuleKit with RapidMiner. The plug-in provides user with two operators: RuleKit Generator and RuleKit Performance. The former is a RapidMiner learner that induces various types of rule models. The latter extends the standard RM Performance operator and allows calculation of performance metrics as well as gathering model characteristics. In the Figure 2 we present an example RapidMiner process which performs regression analysis on the methane data set (predicting methane concentration in a coal mine) and a wizard for specifying user’s knowledge in the guided induction.

R package. As a last test case, we present the application of RuleKit R package for analyzing factors contributing to the patients’ survival following bone marrow transplants. The corresponding data set (BMT-Ch) is integrated with the package in the form of the standard R data frame. The training and applying a model is performed by a function learn_rules which returns a named list containing induced rules, survival function estimates, test set performance metrics, etc. In Figure 3 we provide an example R code for training the model and visualizing corresponding survival functions estimates.

4 Conclusions and Future Work

We demonstrated that RuleKit can be successfully applied for training and evaluation of rule-based models in classification, regression, and survival tasks. The multiplicity of options and modes together with the powerful and flexible experimental environment makes presented suite a useful tool for data analysis and knowledge discovery. In the future, we plan to extend RuleKit with algorithms for inducing action rules (Hajja et al., 2014) and oblique rules (Sikora and Gudyś, 2013). The applicability of the suite could be additionally enhanced by providing Python wrapper or standalone graphical interface.

Acknowledgments

This work was supported by Polish National Centre for Research and Development (NCBiR) within the Operational Programme Intelligent Development (grant no. POIR.04.01.02-00-0024/17-00); Rector of Silesian University of Technology (grant no. 02/020/RGJ18/0126); Institute of Informatics at Silesian University of Technology within the statutory research project (BKM18/RAU2/556).

Bibliography14

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Breiman et al. (1984) L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees . Chapman & Hall/CRC, Boca Raton, London, New York, Washington, 1984.
2Clark and Niblett (1989) P. Clark and T. Niblett. The CN 2 induction algorithm. Mach. Learn. , 3(4):261–283, 1989.
3Cohen (1995) W. W. Cohen. Fast Effective Rule Induction. In ICML 1995 , pages 115–123. Morgan Kaufmann, 1995.
4Demšar et al. (2013) J. Demšar, T. Curk, A. Erjavec, et al. Orange: Data Mining Toolbox in Python. J. Mach. Learn. Res. , 14(1):2349–2353, 2013.
5Fürnkranz et al. (2012) J. Fürnkranz, D. Gamberger, and N. Lavrač. Foundations of Rule Learning . Springer-Verlag, Berlin, Heidelberg, 2012.
6Hajja et al. (2014) A. Hajja, Z. W. Ras, and A. Wieczorkowska. Hierarchical object-driven action rules. J. Intell. Inf. Syst. , 42(2):207–232, 2014.
7Holmes et al. (1999) G. Holmes, M. Hall, and E. Frank. Generating Rule Sets from Model Trees. In IJCAI 1991 , pages 1–12. Springer, 1999.
8Michalski (1969) R. S. Michalski. On the quasi-minimal solution of the general covering problem. In FCIP 69 , volume A 3, pages 125–128, 1969.