Learning what matters - Sampling interesting patterns
Vladimir Dzyuba, Matthijs van Leeuwen

TL;DR
This paper introduces LetSIP, a novel interactive pattern sampling algorithm that learns user interests through feedback, enabling efficient, personalized data exploration with improved quality and diversity of discovered patterns.
Contribution
It presents a new sampling approach combining weighted sampling and learning to rank, allowing direct adaptation to user interests in pattern mining.
Findings
LetSIP outperforms state-of-the-art methods in quality-diversity trade-offs.
The system enables efficient, user-specific, anytime data exploration.
It effectively learns user preferences through feedback during pattern sampling.
Abstract
In the field of exploratory data mining, local structure in data can be described by patterns and discovered by mining algorithms. Although many solutions have been proposed to address the redundancy problems in pattern mining, most of them either provide succinct pattern sets or take the interests of the user into account-but not both. Consequently, the analyst has to invest substantial effort in identifying those patterns that are relevant to her specific interests and goals. To address this problem, we propose a novel approach that combines pattern sampling with interactive data mining. In particular, we introduce the LetSIP algorithm, which builds upon recent advances in 1) weighted sampling in SAT and 2) learning to rank in interactive pattern mining. Specifically, it exploits user feedback to directly learn the parameters of the sampling distribution that represents the user's…
| Frequent | ||||
|---|---|---|---|---|
| patterns | ||||
| anneal | ||||
| australian | ||||
| german | ||||
| heart | ||||
| hepatitis | ||||
| lymph | ||||
| primary | ||||
| soybean | ||||
| vote | ||||
| zoo | ||||
| Regret: avg. | Regret: max. | Regret: | ||
| Query size | 5 | |||
| 10 | ||||
| All results below are for query size of | ||||
| Features | I | |||
| ILF | ||||
| ILFT | ||||
| Range | 0.5 | |||
| 0.1 | ||||
| Query retention | 0 | |||
| 1 | ||||
| 2 | ||||
| 3 | ||||
| Cell sampling | Random | |||
| Top(1) | ||||
| Top(2) | ||||
| Top(3) | ||||
| Regret: avg. | Regret: joint entropy | ||||||
|---|---|---|---|---|---|---|---|
| LetSIP | * | * | |||||
| IPM | * | * | |||||
| APLe | * | – | – | –* | |||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: Department of Computer Science, KU Leuven, Belgium 22institutetext: LIACS, Leiden University, The Netherlands
22email: [email protected], 22email: [email protected]
Learning what matters –
Sampling interesting patterns
Vladimir Dzyuba 11
Matthijs van Leeuwen 22
Abstract
In the field of exploratory data mining, local structure in data can be described by patterns and discovered by mining algorithms. Although many solutions have been proposed to address the redundancy problems in pattern mining, most of them either provide succinct pattern sets or take the interests of the user into account—but not both. Consequently, the analyst has to invest substantial effort in identifying those patterns that are relevant to her specific interests and goals.
To address this problem, we propose a novel approach that combines pattern sampling with interactive data mining. In particular, we introduce the LetSIP algorithm, which builds upon recent advances in 1) weighted sampling in SAT and 2) learning to rank in interactive pattern mining. Specifically, it exploits user feedback to directly learn the parameters of the sampling distribution that represents the user’s interests.
We compare the performance of the proposed algorithm to the state-of-the-art in interactive pattern mining by emulating the interests of a user. The resulting system allows efficient and interleaved learning and sampling, thus user-specific anytime data exploration. Finally, LetSIP demonstrates favourable trade-offs concerning both quality–diversity and exploitation–exploration when compared to existing methods.
1 Introduction
†† This document is an extended version of a conference publication [14].
Imagine a data analyst who has access to a medical database containing information about patients, diagnoses, and treatments. Her goal is to identify novel connections between patient characteristics and treatment effects. For example, one treatment may be more effective than another for patients of a certain age and occupation, even though the latter is more effective at large. Here, age and occupation are latent factors that explain the difference in treatment effect.
In the field of exploratory data mining, such hypotheses are represented by patterns [1] and discovered by mining algorithms. Informally, a pattern is a statement in a formal language that concisely describes the structure of a subset of the data. Unfortunately, in any realistic database the interesting and/or relevant patterns tend to get lost among a humongous number of patterns.
The solutions that have been proposed to address this so-called pattern explosion, caused by enumerating all patterns satisfying given constraints, can be roughly clustered into four categories: 1) condensed representations [10], 2) pattern set mining [9], 3) pattern sampling [5], 4) and—most recently—interactive pattern mining [20]. As expected, each of these categories has its own strengths and weaknesses and there is no ultimate solution as of yet.
That is, condensed representations, e.g., closed itemsets, can be lossless but usually still yield large result sets; pattern set mining and pattern sampling can provide succinct pattern sets but do not take the analyst into account; and existing interactive approaches take the user into account but do not adequately address the pattern explosion. Consequently, the analyst has to invest substantial effort in identifying those patterns that are relevant to her specific interests and goals, which often requires extensive data mining expertise.
Aims and contributions Our overarching aim is to enable analysts—such as the one described in the medical scenario above—to discover small sets of patterns from data that they consider interesting. This translates to the following three specific requirements. First, we require our approach to yield concise and diverse result sets, effectively avoiding the pattern explosion. Second, our method should take the user’s interests into account and ensure that the results are relevant. Third, it should achieve this with limited effort on behalf of the user.
To satisfy these requirements, we propose an approach that combines pattern sampling with interactive data mining techniques. In particular, we introduce the LetSIP algorithm, for Learn to Sample Interesting Patterns, which follows the Mine, Interact, Learn, Repeat framework [13]. It samples a small set of patterns, receives feedback from the user, exploits the feedback to learn new parameters for the sampling distribution, and repeats these steps. As a result, the user may utilize a compact diverse set of interesting patterns at any moment, blurring the boundaries between learning and discovery modes.
We satisfy the first requirement by using a sampling technique that samples high quality patterns with high probability. While sampling does not guarantee diversity per se, we demonstrate that it gives concise yet diverse results in practice. Moreover, sampling has the advantage that it is anytime, i.e., the result set can grow by user’s request. LetSIP’s sampling component is based on recent advances in sampling in SAT [12] and their extension to pattern sampling [15].
The second requirement is satisfied by learning what matters to the user, i.e., by interactively learning the distribution patterns are sampled from. This allows the user to steer the sampler towards subjectively interesting regions. We build upon recent work [13, 7] that uses preference learning to learn to rank patterns.
Although user effort can partially be quantified by the total amount of input that needs to be given during the analysis, the third requirement also concerns the time that is needed to find the first interesting results. For this it is of particular interest to study the trade-off between exploitation and exploration. As mentioned, one of the benefits of interactive pattern sampling is that the boundaries between learning and discovery are blurred, meaning that the system keeps learning while it continuously aims to discover potentially interesting patterns.
We evaluate the performance of the proposed algorithm and compare it to the state-of-the-art in interactive pattern mining by emulating the interests of a user. The results confirm that the proposed algorithm has the capacity to learn what matters based on little feedback from the user. More importantly, the LetSIP algorithm demonstrates favourable trade-offs concerning both quality–diversity and exploitation–exploration when compared to existing methods.
2 Interactive pattern mining: Problem definition
Recall the medical analyst example. We assume that after inspecting patterns, she can judge their interestingness, e.g., by comparing two patterns. Then the primary task of interactive pattern mining consists in learning a formal model of her interests. The second task involves using this model to mine novel patterns that are subjectively interesting to the user (according to the learned model).
Formally, let denote a dataset, a pattern language, a (possibly empty) set of constraints on patterns, and the unknown subjective pattern preference relation of the current user over , i.e., implies that the user considers pattern subjectively more interesting than pattern :
Problem 1 (Learning)
Given , , and , dynamically collect feedback with respect to patterns in and use to learn a (subjective) pattern interestingness function such that .
The mining task should account for the potential diversity of user’s interests. For example, the analyst may (unwittingly) be interested in several unrelated treatments with disparate latent factors. An algorithm should be able to identify and mine patterns that are representative of these diverse hypotheses.
Problem 2 (Mining)
Given , , , and , mine a set of patterns that maximizes a combination of interestingness and diversity of patterns.
The interestingness of can be quantified by the average quality of its members, i.e., . Diversity measures quantify how different patterns in a set are from each other. Joint entropy is a common diversity measure [24] (see Section 4 for the definition).
3 Related work
In this paper, we focus on two classes of related work aimed at alleviating the pattern explosion, namely
- pattern sampling and
- interactive pattern mining.
Pattern sampling. First pattern samplers are based on Markov Chain Monte Carlo (MCMC) random walks over the pattern lattice [5, 17, 4]. Their main advantage is that they support “black box” distributions, i.e., they do not require any prior knowledge about the target distribution, a property essential for interactive exploration. However, they often converge only slowly to the desired target distribution and require the selection of the “right” proposal distributions.
Samplers that are based on alternative approaches include direct two-step samplers and XOR samplers. Two-step samplers [6, 8], while provably accurate and efficient, only support a limited number of distributions and thus cannot be easily extended to interactive settings. Flexics [15] is a recently proposed pattern sampler based on the latest advances in weighted constrained sampling in SAT [12]. It supports black-box target distributions, provides guarantees with respect to sampling accuracy and efficiency, and has been shown to be competitive with the state-of-the-art methods described above.
Interactive pattern mining. Most recent approaches to interactive pattern mining are based on learning to rank patterns. They first appeared in Xin et al. [25] and Rueping [22] and were independently extended by Boley et al. [7] and Dzyuba et al. [13]. The central idea behind these algorithms is to alternate between mining and learning. Priime [3] focuses on advanced feature construction for interactive mining of structured data, e.g., sequences or graphs.
To the best of our knowledge, IPM [2] is the only existing approach to interactive itemset sampling. It uses binary feedback (“likes” and “dislikes”) to update weights of individual items. Itemsets are sampled proportional to the product of weights of constituent items. Thus, the model of user interests in IPM is fairly restricted; moreover, it potentially suffers from convergence issues typical for MCMC. We empirically compare LetSIP with IPM in Section 6.
4 Preliminaries
Pattern mining and sampling. We focus on itemset mining, i.e., pattern mining for binary data. Let denote a set of items. Then, a dataset is a bag of transactions over , where each transaction is a subset of , i.e., ; is a set of transaction indices. The pattern language also consists of sets of items, i.e., . An itemset occurs in a transaction , iff . The frequency of is the proportion of transactions in which it occurs, i.e., . In labeled datasets, each transaction has a label from ; are defined accordingly.
Given an (arbitrarily ordered) pattern set of size , its diversity can be measured using joint entropy , which essentially quantifies the overlap of sets of transactions, in which the patterns in occur. Let denote the Iverson bracket, a binary -tuple, and the fraction of transactions in covered only by patterns in that correspond to non-zero elements of (e.g., if and , we only count the transactions covered by the 1st and the 3rd pattern and not covered by the 2nd pattern). Joint entropy is defined as . is measured in bits and bounded from above by . The higher the joint entropy, the more diverse are the patterns in in terms of their occurrences in .
The choice of constraints and a quality measure allows a user to express her analysis requirements. The most common constraint is minimal frequency . In contrast to hard constraints, quality measures are used to describe soft preferences that allow to rank patterns; see Section 6 for examples.
While common mining algorithms return the top- patterns w.r.t. a measure , pattern sampling is a randomized procedure that ‘mines’ a pattern with probability proportional to its quality, i.e., , if satisfies , and [math] otherwise, where is the (unknown) normalization constant. This is an instance of weighted constrained sampling.
Weighted constrained sampling. This problem has been extensively studied in the context of sampling solutions of a SAT problem [21]. WeightGen [12] is a recent algorithm for approximate weighted sampling in SAT. The core idea consists of partitioning the solution space into a number of “cells” and sampling a solution from a random cell. Partitioning with desired properties is obtained via augmenting the SAT problem with uniformly random XOR constraints (XORs).
To sample a solution, WeightGen dynamically estimates the number of XORs required to obtain a suitable cell, generates random XORs, stores the solutions of the augmented problem (i.e., a random cell), and returns a perfect weighted sample from the cell. Owing to the properties of partitioning with uniformly random XORs, WeightGen provides theoretical performance guarantees regarding quality of samples and efficiency of the sampling procedure.
For implementation purposes, WeightGen only requires an efficient oracle that enumerates solutions. Moreover, it treats the target sampling distribution as a black box: it requires neither a compact description thereof, nor the knowledge of the normalization constant. Both features are crucial in pattern sampling settings. Flexics [15], a recently proposed pattern sampler based on WeightGen, has been shown to be accurate and efficient. See Appendix 0.A for a more detailed description of these algorithms.
Preference learning. The problem of learning ranking functions is known as object ranking [19]. A common solving technique involves minimizing pairwise loss, e.g., the number of discordant pairs. For example, user feedback is seen as . Given feature representations of objects , object ranking is equivalent to positive-only classification of difference vectors, i.e., a ranked pair example corresponds to a classification example . All pairs comprise a training dataset for a scoring classifier. Then, the predicted ranking of any set of objects can be obtained by sorting these objects by classifier score descending. For example, this formulation is adopted by SvmRank [18].
5 Algorithm
Key questions concerning instantiations of the Mine, interact, learn, repeat framework include
- the feedback format,
- learning quality measures from feedback,
- mining with learned measures, and crucially,
- selecting the patterns to show to the user. As pattern sampling has been shown to be effective in mining and learning, we present LetSIP, a sampling-based instantiation of the framework which employs Flexics. The sequel describes the mining and learning components of LetSIP. Algorithm 1 shows its pseudocode.
Mining patterns by sampling. Recall that the main goal is to discover patterns that are subjectively interesting to a particular user. We use parameterised logistic functions to measure the interestingness/quality of a given pattern :
[TABLE]
where is the vector of pattern features for , are feature weights, and is a parameter that controls the range of the interestingness measure, i.e. . Examples of pattern features include , , ; and , where denotes the Iverson bracket. Weights reflect feature contributions to pattern interestingness, e.g., a user might be interested in combinations of particular items or disinterested in particular transactions. The set of features would typically be chosen by the mining system designer rather than by the user herself. We empirically evaluate several feature combinations in Section 6.
Specifying feature weights manually is tedious and opaque, if at all possible. Below we present an algorithm that learns the weights based on easy-to-provide feedback with respect to patterns. This motivates our choice of logistic functions: they enable efficient learning. Furthermore, their bounded range yields distributions that allow efficient sampling directly proportional to with Flexics. Parameter essentially controls the tilt of the distribution [15].
User interaction & learning from feedback. Following previous research [13], we use ordered feedback, where a user is asked to provide a total order over a (small) number of patterns according to their subjective interestingness; see Figure 6 for an example. We assume that there exists an unknown, user-specific target ranking , i.e., a total order over . The inductive bias is that there exists such that . We apply the reduction of object ranking to binary classification of difference vectors (see Section 4). Following Boley et al. [7], we use Stochastic Coordinate Descent (SCD) [23] for minimizing L1-regularized logistic loss. However, unlike Boley et al., we directly use the learned functions for sampling.
SCD is an anytime convex optimization algorithm, which makes it suitable for the interactive setting. Its runtime scales linearly with the number of training pairs and the dimensionality of feature vectors. It has two parameters:
- the number of weight updates (per iteration of LetSIP) and
- the regularization parameter . However, direct learning of is infeasible, as it results in a non-convex loss function. We therefore use SCD to optimize the standard logistic loss, which is convex, and use the learned weights in .
Selecting patterns to show to the user. An interactive system seeks to ensure faster learning of accurate models by targeted selection of patterns to show to the user; this is known as active learning or query selection. Randomized methods have been successfully applied to this task [13]. Furthermore, in large pattern spaces the probability that two redundant patterns are sampled in one (small) batch is typically low. Therefore, a sampler, which produces independent samples, typically ensures diversity within batches and thus sufficient exploration. We directly show patterns sampled by Flexics proportional to to the user, for which she has to provide a total order as feedback.
We propose two modifications to Flexics, which aim at emphasising exploitation, i.e., biasing sampling towards higher-quality patterns. First, we employ alternative cell sampling strategies. Normally Flexics draws a perfect weighted random sample, once it obtains a suitable cell. We denote this strategy as Random. We propose an alternative strategy Top(), which picks the highest-quality patterns from a cell (Line 15 in Algorithm 1). We hypothesize that, owing to the properties of random XOR constraints, patterns in a cell as well as in consecutive cells are expected to be sufficiently diverse and thus the modified cell sampling does not disrupt exploration.
Rigorous analysis of (unweighted) uniform sampling by Chakraborty et al. shows that re-using samples from a cell still ensures broad coverage of the solution space, i.e., diversity of samples [11]. Although as a downside, consecutive samples are not i.i.d., the effects are bounded in theory and inconsequential in practice. We use these results to take license to modify the theoretically motivated cell sampling procedure. Although we do not present a similar theoretical analysis of our modifications, we evaluate them empirically.
Second, we propose to retain the top patterns from the previous query and only sample new patterns (Lines 9–10). This should help users to relate the queries to each other and possibly exploit the structure in the pattern space.
6 Experiments
The experimental evaluation focuses on 1) the accuracy of the learned user models and 2) the effectiveness of learning and sampling. Evaluating interactive algorithms is challenging, for domain experts are scarce and it is hard to gather enough experimental data to draw reliable conclusions. In order to perform extensive evaluation, we emulate users using (hidden) interest models, which the algorithm is supposed to learn from ordered feedback only.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Aggarwal, C.C., Han, J. (eds.): Frequent Pattern Mining. Springer (2014)
- 2[2] Bhuiyan, M., Hasan, M.A.: Interactive knowledge discovery from hidden data through sampling of frequent patterns. Statistical Analysis and Data Mining: The ASA Data Science Journal 9(4), 205–229 (aug 2016)
- 3[3] Bhuiyan, M., Hasan, M.A.: PRIIME: A generic framework for interactive personalized interesting pattern discovery. In: Proc. of IEEE Big Data. pp. 606–615 (2016)
- 4[4] Boley, M., Gärtner, T., Grosskreutz, H.: Formal concept sampling for counting and threshold-free local pattern mining. In: Proceedings of SDM. pp. 177–188 (2010)
- 5[5] Boley, M., Grosskreutz, H.: Approximating the number of frequent sets in dense data. Knowledge and information systems 21(1), 65–89 (2009)
- 6[6] Boley, M., Lucchese, C., Paurat, D., Gärtner, T.: Direct local pattern sampling by efficient two-step random procedures. In: Proceedings of KDD. pp. 582–590 (2011)
- 7[7] Boley, M., Mampaey, M., Kang, B., Tokmakov, P., Wrobel, S.: One Click Mining – interactive local pattern discovery through implicit preference and performance learning. In: Workshop Proceedings of KDD. pp. 28–36 (2013)
- 8[8] Boley, M., Moens, S., Gärtner, T.: Linear space direct pattern sampling using coupling from the past. In: Proceedings of KDD. pp. 69–77 (2012)
