Association rule mining and itemset-correlation based variants
Niels M\"undler

TL;DR
This paper discusses association rule mining, focusing on the apriori algorithm and its variants that handle quantitative attributes and generalizations while maintaining efficient pruning capabilities.
Contribution
It introduces variants of the apriori algorithm for quantitative attributes and item generalizations, preserving the pruning property for efficient rule mining.
Findings
Presented the apriori algorithm as a basis for association rule mining.
Proposed variants for handling quantitative attributes and item generalizations.
Maintained the downward closure property in the variants for efficient pruning.
Abstract
Association rules express implication formed relations among attributes in databases of itemsets. The apriori algorithm is presented, the basis for most association rule mining algorithms. It works by pruning away rules that need not be evaluated based on the user specified minimum support confidence. Additionally, variations of the algorithm are presented that enable it to handle quantitative attributes and to extract rules about generalizations of items, but preserve the downward closure property that enables pruning. Intertransformation of the extensions is proposed for special cases.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Rough Sets and Fuzzy Logic · Data Management and Algorithms
MethodsPruning
Association rule mining and itemset-correlation based variants
Niels Mündler
Department of Informatics
Technische Universität München
Email: [email protected]
Abstract
Association rules express implication formed relations among attributes in databases of itemsets. The apriori algorithm is presented, the basis for most association rule mining algorithms. It works by pruning away rules that need not be evaluated based on the user specified minimum support confidence. Additionally, variations of the algorithm are presented that enable it to handle quantitative attributes and to extract rules about generalizations of items, but preserve the downward closure property that enables pruning. Intertransformation of the extensions is proposed for special cases.
Index Terms:
Data Mining Quantitative Generalized Association Rule Mining
I Introduction
First introduced by Agrawal et al. in [1] as an extension for existing databases, association rules provide a means for discovering in a large database of items that appear together implications of the form "if are in the set then also are in the set" associated with a measure for the probability that this implication holds. A first application domain for this emerged in the area of shopping where digitalization made large amounts of such data available. Through extraction of association rules an insight on consumer behaviour should be gained.
The database contains a set of transactions which contain all of the items bought by a customer at once. An association rule means that when customers bought aubergines and charcoal, they also often bought beer. Buying beer though does not have to imply that either aubergines or charcoal are bought, for example when drunk with weisswurst for breakfast. Thus not all association rules are symmetrical. This rule is said to have support of 10% if aubergine, charcoal and beer were contained in 10% of all transactions. The percentage of transactions that also contained beer when aubergine and charcoal were contained is called confidence. Usually a user-specified minimum support and minimum confidence for extracted rules is specified.
It can easily be seen that a data base with different items there are possible association rules. Hence, based on the minimum support and confidence, sensible pruning mechanisms have to be used such that not many more rules are evaluated than are included in the result set. In the pioneer works of Agrawal et al. [1, 2] algorithms that perform well on large datasets are proposed, among them the apriori algorithm which will be explained in detail in \autorefchap:apriori. In addition, common variations of the apriori algorithm are presented that make it possible to work on datbases with quantitative data and with generalizations of the items. All of the presented variations preserve the downward closure property of itemsets that are to be generated, making it possible to use the main pruning strategy of the apriori algorithm.
For related work, a very broad overview over the topic of data mining in general in databases is given by Chen et al. in [3], yet focusing not too much on association rules.
II Association Rules
II-A Motivation
Consider the database of a supermarket. The management of the supermarket might be interested in which items appear often together in the shopping baskets of their customers. This information can then be used for strategic decisions. For example if the market knows that when providing more aubergines to the customers, more charcoal should be provided too. Or if all rules of the form were known, the sale of beer could be boosted by placing it near to items in or by reducing the price of the items in . Of course the management is only interested in behavior of a significant amount of customers and implications that hold for a large proportion of the transactions where the left side is satisfied. In the following sections, a solution to this problem is described that was introduced by Agrawal and Srikant in [2].
II-B Formal definition
The definition is based on the definition introduced in [1]. For a set of attributes , an association rule is a rule of the form where and . is called the antecedent and the consequent of the rule and the elements of those sets are called items. Sets of items are also called -itemsets. An association rule is said to be contained in a transaction or itemset if . Similarly an itemset is contained in if . The database or dataset is the set of all collected transactions. A rule or itemset has if it is contained in of the transactions in the database. This can be used as a sign of statistical significance. Also, a rule has if for of the transactions with also holds , which means that the rule is contained in of the transactions that do contain the antecedent. It can be regarded as equivalent to , the likelihood of also "occuring" when is given, based on the database .
Usually there is a user defined minimum confidence and minimum support, such that all extracted association rules have a support of at least the minimum support and a confidence of at least the minimum confidence.
An itemset that has at least the minimally specified support is called a frequent itemset. An arbitrary total order on the attributes in the database is introduced, and all itemsets and transactions are regarded as tuples ordered with respect to this order.
II-C Problem decomposition
In the process of extracting all association rules that do have minimum support and minimum confidence, an algorithm must
- •
Generate frequent itemsets
- •
Evaluate all association rules where and keep those that satisfy minimum confidence and support
It suffices to generate frequent itemsets because all of the corresponding association rules have the same support and we are only interested in association rules which have at least minimum support. The apriori algorithm presents an efficient method for the generation of frequent itemsets by only considering combinations of smaller frequent itemsets. It is described in detail in \autorefchap:apriori. A method for the efficient generation of association rules from the frequent itemsets is described in \autorefchap:discovering_rules.
II-D The Apriori Algorithm
The approach is based on the observation that every subset of an itemset has to have at least the same support. This can be seen easily as every subset of the itemset is also contained in the transaction that originally contained . It follows that if any itemset is not frequent, all larger sets that contain are also not frequent. Thus, for generating candidate frequent itemsets of size it suffices to consider candidate itemsets of size that are unions of frequent itemsets of size . For each of the candidates, the actual support in the database is checked by scanning the whole database. After each scan, the actual frequent itemsets are used for the next iteration. The overall procedure can be seen in \autoreffig:visualization-generation and the algorithm is shown in \autorefalg:apriori.
II-D1 Candidate generation
In order not to generate any itemset multiple times, only -itemsets are combined into a -itemset where the first items are equal. This results in one unique way to construct a set from smaller sets. For example ABCD will only be constructed from ABC and ABD as all other combinations of -itemsets already differ in the first or second item. Additionally this ensures that the result is maximally of size . Hence in the join phase of \autorefalg:candidate_generation candidate itemsets of size are generated by a join of the frequent itemsets of size on the condition of being equal in the first items and not being equal for the last item.
Assuming that all generated frequent sets size were already generated, due to the above observation if any subset is not among the already generated sets, has to be non-frequent. Then, is non-frequent too. Thus in the prune step of \autorefalg:candidate_generation it is checked if all -subsets of a newly generated itemset were already generated.
II-D2 Subset determination
Finally it should be ensured that the comparison of frequent itemset candidates and transactions in the database is evaluated efficiently. For this, the candidates are stored in a hash-tree where each node refers to either a set of candidate itemsets (leaf) or another node (inner node). The depth of the node corresponds then to the hash of the th item in the candidate itemset. By recursively descending the hash tree for every suffix of a transaction (remainder) , a set of candidate itemsets is reached for each of which is checked whether it is contained in . If so, it is added to the answer set. If the itemset is contained in , its first item is contained in too. By hashing on every suffix, all items in are first items once too, so there must occur a match before missing any items. After each descent, only the remaining items need to be considered.
II-E Discovering Rules from frequent itemsets
As the confidence can be seen as equivalent to , is computed by dividing by . When the support of each itemset is stored in the itemset generation process, this computation can be done quickly. Still the number of association rules that can be extracted from each frequent itemset may be quite large.
Naively to discover all rules holding in a frequent itemset , all of the subsets would have to be evaluated whether the rule has minimum confidence. If this is done for all frequent itemsets, the rule is also checked as is also a frequent itemset.
A lot of confidence tests can be pruned. First,
[TABLE]
Using the similarity to probability, it follows that
[TABLE]
If is inserted instead of , it can be seen that decreases as . Thus the confidence of the rule increases. Thus if does hold, all must also hold. Like in \autorefchap:apriori the combination of rules with sufficient confidence can now be used to generate candidate rules with larger consequents.
II-F Example
The apriori algorithm is shortly demonstrated based on the database shown in \autoreffig:ex-trans-db with the attributes , , , and . Assume the user requests all association rules with minimum support of 30% and minimum confidence of 60%. For the initialization, the set of frequent 1-itemsets is generated. Only one transaction involves Edam. With a support of , below the specified minimum support, is not included in the set . From this set, the new set of candidate itemsets of size 2 is generated. As there are only sets of one item so far and no excluded items that could accidentally have been joined in, is simply the cross product of the above set with itself. Next, the whole database is scanned to compute the actual support of the generated candidates. It turns out that and are too rare combinations (support of ) but all remaining candidates satisfy the support condition.
The second iteration follows where and sets of size 3 are generated from . For this, for example and can be joined to form , while and are not joined because their first elements already differ 111And in this case also because is not a frequent itemset.. In the newly generated set we can still check for every element whether any of its subsets are non-frequent, which does mean that we can prune it. This is the case as we have not accepted in the previous iteration. is pruned from the candidate set. After checking all valid combinations and ensuring the subset closure, we retreive as the only candidate of size 3. After a single scan of the database, we can ensure that it has support of and is accepted as frequent itemset. The overall frequent itemsets are now all of the determined frequent itemsets of all lengths.
The next step is the generation of association rules from the set of frequent itemsets. The procedure will be shown by the example of the frequent itemset . First, single consequent rules are generated and their confidence is computed, , and . By coincidence all of the rules are accepted. The new set of 2-item consequents is generated from the consequents forming , being (compare itemset generation) all pairs of items from . By computing the confidence for each rule, we retrieve and but delete . In the next iteration the procedure stops as .
II-G Interestingness Measures
As can be seen in the above example, even from small itemsets, large amounts of association rules can be extracted. Meanwhile there may be quite a few redundant rules among. Knowing that and , it might not be surprising that . Based on this, several interestingness measures have been proposed for pruning association rules from the set of generated rules. The urgency of filtered association rules becomes even more obvious when considering the case presented by Brin et al. in [4]: It is easy to construct cases where due to a large overall support of an item, even negative correlations suffice to generate an association rule with. This is demonstrated with \autoreffig:neg-correlation, where the rule is generated with a support of and confidence of which is quite high. Yet, when considering that the probability of any customer drinking coffee is it can be seen that this actually means a negative correlation between coffee and tea.
Interestingness measures can be based on the expected value of an extracted rule thus redundancy or surprisingness, as well as on utility or actionability. Further methods are possible as for the above example a chi-squared measure is proposed. A detailed overview is presented by Geng and Hamilton in [5].
Another more theoretical way is presented by Pasquier et al. in [6] where from the reduced set of closures of itemsets (the maximal set that has the same support as its subset) reduced association rules are generated from which all original association rules could be generated but which also already serves a human understandable set of less redundant rules.
III Quantitative Association Rules
Association rules consider only whether a product was bought or not. Quantitative attributes like amount or price are not at all considered. Still there might be of relations, for example . This would express that large amounts of beer imply a grilling party. Alternatively, one can imagine the number of seconds customers spend in front of the shelf to be incorporated in the database.
The dataset of i.e. shopping transactions is now extended to include not only whether a specific product was bought, but it also contains an associated quantity i.e. the amount of products bought. Rules taking into account these quantities and especially all of their subranges can be mined using the generic boolean association rule algorithm. For this, ranges of quantities are introduced in place of every quantitative attribute and for each item we store for each generated range whether the quantitiy item lied inside of the range or not. If the dataset contains for example transactions including 1, 2 or 5 litre beer, and this quantity was stored in the attribute beer before we introduce the boolean attributes corresponding to each interval. If 2 litres beer were bought, are now items in the transaction. It can easily be seen that if all subranges are included, a quadratic amount of ranges is generated. Even when restricting to ranges that the actual value lies included in, there are on average ranges that include a specific value [7].
If too few subranges are included it might happen that intervals that satisfy minimum support and confidence are excluded. When restricted to equally sized intervals, choosing slim intervals, the support for each interval could be too low. In contrast, if the intervals are too wide, the confidence might be reduced [7]. At last, if an association rule containing a subrange does have minimum support, all contained ranges do have minimum support, drastically increasing computation time. Thus it should be carefully decided which ranges to include.
III-A Formal Definition
In addition to \autorefchap:formal_crisp_ass_rules we define for each itemset a function assigning each item in the set its quantity. The quantity interval of attribute is defined as
III-B Proposed Algorithm
The algorithm prosposed by Srikant and Agrawal in [7] introduces a user defined maximum support and decomposes the transformation as follows:
- •
Determine the number of partitions for each quantity interval (see \autorefchap:quant_partitioning)
- •
Map the values in each quantitative interval to consecutive integers such that the order of the values is preserved.
- •
Find the support for each value of quantitative attributes and combine adjacent values that satisfy minimum support if they do not exceed the specified maximum support.
- •
Transform the itemset into boolean itemsets by replacing all quantitative attributes with the determined ranges.
After this procedure, the standard algorithm from \autorefchap:apriori is applied to generate boolean association rules. In order to remove redundant rules regarding subintervals, interestingness measures can again be introduced.
III-C Optimal Interval Partitioning
In order to measure the optimality of the interval partitioning, so called "-partial-completeness" is introduced in [7]. The intuitive idea is that for each rule that would be obtained when considering all of the ranges over the involved quantitative attributes, the generalized rule obtained by only considering the partitioned intervals should be as "close" to as defined by the . "Closeness" is defined by having at most times the support of the rule . Essentially, every rule obtained by the partitioning should contain as few other quantities in the dataset as possible.
As Srikant et al.[7] have shown, the number of required partitions is
[TABLE]
Assuming that each partition equally splits the support, a partitioning in intervals of equal size is generated.
This assumption must not always hold as seen in \autoreffig:equi_width_vs_cluster which is why the intervals can be generated sensitive to the data by diverse clustering approaches. As lots of these approaches are based on continuous values they are described at once.
III-D Continuous Intervals
Considering continuous instead of discrete and finite quantitative attributes, there is an infinite number of interval borders that can be chosen. Alternatively to the equal-size approach, one can consider the available data when partitioning the interval. A first approach is an equal frequency approach, where every partition contains the same amount of data points. Advanced techniques apply a clustering of the feature interval, trying to group along values with high frequency. Examples for such procedures can be seen in [8, 9].
III-E Fuzzy Association Rules
The before introduced concept of transforming ranges into items can lead to problems at the border of an interval. If i.e. , is it so much more likely that a customer buys charcoal when buying 3.0 litres beer instead of 2.9 litres? A way to circumvent this is to make the importance or representativeness of values inside an interval decrease with its proximity to the border of the interval. This is generally achieved by introducing fuzzy sets. Fuzzy value sets can overlap and have non-binary membership values. A detailed introduction into fuzzy association rules by Helm can be found in [10]. As can be seen by comparing the work of Tan[9] and Thomas, Raju[11], in both mining fuzzy rules and quantitative rules, different clustering techniques are still a topic of high importance.
IV Generalizing Association Rules
IV-A Motivation
Consider again the database of the supermarket. The manager of the supermarket might be interested in how to arrange the items in the market such that all products from categories that are usually bought together can be found in close shelves.
Until now a system can detect associations between specific products. For the shelve problem, one would need a rule over the generalizations of products. For example instead of , the rule might also hold. At the same time, while and might already hold, the generalization might not hold as the items are often bought together for barbecue but don’t make up the major part of vegetable concerned transactions. We will see that again as proposed by Srikant and Agrawl in [12], the apriori algorithm can be used. The procedure is described in the following subsections.
In addition to the dataset there now are taxonomies on the attributes of the database. Instead of a forest of trees, these taxonomies are combined into a directed acyclic graph.
IV-B Formal definition
In addition to \autorefchap:formal_crisp_ass_rules a directed acyclic graph with all the items of the dataset as leafs is given, the taxonomy graph. Item is called specialization of and is called generalization of if there is an edge in from to .
IV-C Basic Algorithm
The support for the generalization of an attribute is not necessarily the sum of the supports of its specializations. This has the simple reason that one transaction can contain several specializations of the same item and could already be seen in one of the motivating examples. Hence a modification of the known apriori algorithm is necessary.
The most basic approach to this problem is to extend every transaction to a transaction , containing all the items and all of its ancestors. For each item in the transaction, all of the ancestors are added. Then with the algorithm from \autorefchap:apriori, association rules between these items can be extracted. This algorithm works, but is quite inefficient.
Some simple optimizations proposed in [12] can directly be included. First, when comparing transactions with candidate itemsets, it is sufficient to include only those ancestors that are element of any candidate itemset. The generalizations of each item can be pre computed from at the beginning to save time.
A more sophisticated optimization is pruning all itemsets that contain and . The intuition is that, if a rule already contains an item, adding its generalization trivially does not reduce its support. More specifically it does not add any meaningful information, so we can prune itemsets of that form (details can be found in [12]).
IV-D Similarities to quantitative association rules
When comparing quantitative and generalized association rules it might seem sensible to conduct quantitative association rule mining by transforming the quantitative attributes to items in a taxonomy. In the quantitative approach, an "optimal" partition of the overall interval is searched for such that only the partition intervals have to be considered. Contrasting this with the generalization approach, a multi-level subinterval approach emerges by introducing each superinterval as a generalization of its sub interval in the taxonomy tree. This of course is technically equivalent to considering each possible superinterval for a value of an itemset. As already noted by Srikant et al. [7], each value lies in subintervals when there are distinct values for the attribute. For few values of the quantitative attribute this may be useful as it avoids loss of information. Otherwise, the use of efficient pruning techniques becomes even more important. Applying clustering to quantitative values may still be useful as a preprocessing step for drastically reducing the size of the generated taxonomy tree.
Similarly, in a tree-formed taxonomy, each leaf can be mapped to a number from left to right. All different leaf level specializations are thus only regarded as a quantity of the top level generalization. With this change, instead of considering items of every level, only groups of items over the hierarchy are considered in the form of an interval. Yet due to the free choice of interval borders, any lowest common ancestor can result in being considered, regardless of its level in the taxonomy. Clustering may then be used to reveal which ancestors are worth being considered most. It has to be noted this is not applicable for multiple joined taxonomies that result in a non-tree-formed DAG. This could make creating non-overlapping and monotonous intervals impossible. The usefulness of this approach has to be evaluated for each case of application individually as the handling of quantitative rules ensures a loss of information. On the other hand, this approach is able to handle large taxonomies 222As described above only intervals are considered instead of a number of generalizations in the original quantitative approach.
In both cases, advanced interestingness rules applied to each problem can be transferred to either variation.
V Use cases
Apart from providing information for market experts, association rules can also be used in recommender systems for new users by recommending items that were frequently bought by others with similar shopping baskets [13]. Or for finding people that influence each other in social networks by finding associations between comments on posts [14].
Because of the set structure of association rules, they are not easily suitable for order dependent rule mining. For tight associations and predictions of new data points as in interpolation, association rules are not suitable. For datasets it could be discovered that and using quantitative rules, but categories instead of values are associated.
VI Summary and Outlook
The apriori algorithm iteratively generates itemsets that increase in size step by step. In the process, it prunes the evaluation of many association rules by exploiting the downward closure property of support and confidence. It can be seen that the rules extracted should be evaluated by domain experts and at least checked against actual correlation before further usage. When applicable they can be used in many different domains, especially market analysis.
For quantitative attributes, the standard algorithm can be extended and efficiently improved by transforming ranges of values to single attributes. In generalized association rules, all itemsets are extended by the ancestors of all contained items. For non-diverse quantitative attributes or very large taxonomies it might even be suitable to convert either extension into the other.
In the future, drawbacks and benefits of this interconversion may be evaluated. The comparison can also be extended to include objective oriented utility based association rules as introduced by Shen et al. in [15]. Also, the extensions to allow for fuzzy sets and continuous intervals pose further challenges regarding sensible set operations and efficient pruning and provide interesting aspects to be evaluated in an overview.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] R. Agrawal, T. Imielinski, and A. N. Swami, “Mining association rules between sets of items in large databases,” in Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, May 26-28, 1993. , P. Buneman and S. Jajodia, Eds. ACM Press, 1993, pp. 207–216. [Online]. Available: https://doi.org/10.1145/170035.170072
- 2[2] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules in large databases,” in VLDB’94, Proceedings of 20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago de Chile, Chile , J. B. Bocca, M. Jarke, and C. Zaniolo, Eds. Morgan Kaufmann, 1994, pp. 487–499. [Online]. Available: http://www.vldb.org/conf/1994/P 487.PDF
- 3[3] M. Chen, J. Han, and P. S. Yu, “Data mining: An overview from a database perspective,” IEEE Trans. Knowl. Data Eng. , vol. 8, no. 6, pp. 866–883, 1996. [Online]. Available: https://doi.org/10.1109/69.553155
- 4[4] S. Brin, R. Motwani, and C. Silverstein, “Beyond market baskets: Generalizing association rules to correlations,” in SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA. , J. Peckham, Ed. ACM Press, 1997, pp. 265–276. [Online]. Available: https://doi.org/10.1145/253260.253327
- 5[5] L. Geng and H. J. Hamilton, “Interestingness measures for data mining: A survey,” ACM Comput. Surv. , vol. 38, no. 3, p. 9, 2006. [Online]. Available: https://doi.org/10.1145/1132960.1132963
- 6[6] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, “Discovering frequent closed itemsets for association rules,” in Database Theory - ICDT ’99, 7th International Conference, Jerusalem, Israel, January 10-12, 1999, Proceedings. , ser. Lecture Notes in Computer Science, C. Beeri and P. Buneman, Eds., vol. 1540. Springer, 1999, pp. 398–416. [Online]. Available: https://doi.org/10.1007/3-540-49257-7\_25
- 7[7] R. Srikant and R. Agrawal, “Mining quantitative association rules in large relational tables,” in Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, June 4-6, 1996. , H. V. Jagadish and I. S. Mumick, Eds. ACM Press, 1996, pp. 1–12. [Online]. Available: https://doi.org/10.1145/233269.233311
- 8[8] M. Moreno García, S. Segrera, V. Batista, and M. Jose, “Improving the quality of association rules by preprocessing numerical data,” May 2019.
