Knowledge Refinement via Rule Selection
Phokion G. Kolaitis, Lucian Popa, and Kun Qian

TL;DR
This paper investigates the computational complexity of selecting optimal rule subsets for knowledge refinement, focusing on minimizing errors in data transformation and entity resolution tasks, and explores bi-objective optimization challenges.
Contribution
It provides a systematic complexity-theoretic analysis of rule selection problems, establishing hardness results and exploring bi-objective optimization complexities.
Findings
Decision problems are computationally hard (NP-hard, DP-complete).
Approximation bounds are established for the minimization problem.
Bi-objective optimization testing is DP-complete.
Abstract
In several different applications, including data transformation and entity resolution, rules are used to capture aspects of knowledge about the application at hand. Often, a large set of such rules is generated automatically or semi-automatically, and the challenge is to refine the encapsulated knowledge by selecting a subset of rules based on the expected operational behavior of the rules on available data. In this paper, we carry out a systematic complexity-theoretic investigation of the following rule selection problem: given a set of rules specified by Horn formulas, and a pair of an input database and an output database, find a subset of the rules that minimizes the total error, that is, the number of false positive and false negative errors arising from the selected rules. We first establish computational hardness results for the decision problems underlying this minimization…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3| False Positive Errors | False Positive + False Negative Errors | ||
| Rule-Select(a,r) | NP-complete | NP-complete | |
| Exact Rule-Select(a,r) | DP-complete | DP-complete | |
| Min Rule-Select(a,r) | approximation upper bound | ||
| approximation lower bound | , for every | , for every | |
| Pareto Opt Solution(a,r) | coNP-complete | coNP-complete | |
| Pareto Front Membership(a,r) | DP-complete | DP-complete | |
| Bi-level Opt Solution(a,r) | coNP-complete | coNP-complete | |
| Bi-level Opt Value(a,r) | DP-complete | DP-complete | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Knowledge Refinement via Rule Selection
Phokion G. Kolaitis1,2 Lucian Popa2 Kun Qian2
1 UC Santa Cruz 2 IBM Research – Almaden
[email protected], [email protected], [email protected]
Abstract
In several different applications, including data transformation and entity resolution, rules are used to capture aspects of knowledge about the application at hand. Often, a large set of such rules is generated automatically or semi-automatically, and the challenge is to refine the encapsulated knowledge by selecting a subset of rules based on the expected operational behavior of the rules on available data. In this paper, we carry out a systematic complexity-theoretic investigation of the following rule selection problem: given a set of rules specified by Horn formulas, and a pair of an input database and an output database, find a subset of the rules that minimizes the total error, that is, the number of false positive and false negative errors arising from the selected rules. We first establish computational hardness results for the decision problems underlying this minimization problem, as well as upper and lower bounds for its approximability. We then investigate a bi-objective optimization version of the rule selection problem in which both the total error and the size of the selected rules are taken into account. We show that testing for membership in the Pareto front of this bi-objective optimization problem is DP-complete. Finally, we show that a similar DP-completeness result holds for a bi-level optimization version of the rule selection problem, where one minimizes first the total error and then the size.
Rules, typically expressed as Horn formulas, are ubiquitous in several different areas of computer science and artificial intelligence. For example, rules are the basic construct of (function-free) logic programs. In data integration (?) and data exchange (?), rules are known as GAV (global-as-view) constraints and are used to specify data transformations between a local (or source) schema and a global (or target) schema. In data mining, rules have many uses, including the specification of contextual preferences (?; ?). In entity resolution, rules have been used to specify blocking functions (?) and entity resolution algorithms (?).
Often, a large set of rules is generated automatically or semi-automatically, and the challenge is to refine the encapsulated knowledge by selecting a subset of rules based on the expected operational behavior of the rules on available data. Rule selection arises naturally in all aforementioned contexts and, in fact, in most contexts involving reasoning about data. Here, we present an example motivated by a real-life application in which we are building a knowledge base of experts in the medical domain, based on public data, and where entity resolution is one of the crucial first steps.
In entity resolution, the aim is to identify references of the same real-world entity across multiple records or datasets. Consider the scenario depicted in Figure 1, where the aim is to identify occurrences of the same author across research publications from PubMed111https://www.ncbi.nlm.nih.gov/pubmed.
This entity resolution task can be modeled using a source schema that includes a relation Author and a link schema that consists of a relation SameAuthor. Sample facts (records) over the source and the link relations are given in Figure 2. As in the frameworks of Markov Logic Networks (?), Dedupalog (?), and declarative entity linking (?), explicit link relations are used to represent entity resolution inferences. In particular, the SameAuthor fact in Figure 2 represents that an inference was made to establish that the author in position 1 of publication with pmid 19132421 is the same as the author in position 1 of publication with pmid 19135934.
For a given entity resolution task, there is typically a large set of candidate rules that may apply on the input data to form matches among the entities. For our concrete scenario, Figure 3 gives a sample of candidate matching rules. These rules involve the alignment of relevant attributes (e.g., lastname with lastname, affiliation with affiliation) and the subsequent application of similarity predicates, filters, and thresholds. The challenge is to find a subset of the candidate rules with the “right” combinations of predicates and thresholds that will lead to high precision and recall with respect to a given set of ground truth data. As an example, both Rule 1 and Rule 2 in Figure 3 generate a SameAuthor link between two author occurrences on two different publications, provided that the last names and first names are identical and provided that there is a sufficient number of common coauthors on the two publications. However, Rule 1, which checks for at least two common coauthors, may turn out to be too imprecise (i.e., may yield too many false positives), while Rule 2, which checks for at least three common coauthors, may result into fewer errors. Different rules may use different predicates in their premises. For example, Rules 5-8 exploit the Jaccard similarity of affiliation, but have different similarity thresholds. Only one of them (Rule 8, with similarity threshold of 50%) may achieve high enough precision.
Thus, the problem becomes how to select a set of rules that achieve high precision (i.e., minimize the number of false positives) and high recall (i.e., minimize the number of false negatives) with respect to a given set of ground truth data; furthermore, one would also like to select a compact (in terms of size) such set of rules.
Similar rule selection problems have been studied in several different contexts. In data exchange, (?) have investigated the mapping selection problem: given a set of rules expressing data transformations between a source and a target schema, and a pair of a source database and a target database , find a subset of the rules that minimizes the sum of the false positive errors, the false negative errors, and the sizes of the rules in . (?) have investigated the view selection problem: given a materialized view , a database , and a collection of sets of rules on , find a set of rules in that is as “close” to the view and as compact as possible. In data mining, (?; ?) investigated the problems of selecting contextual preference rules and association rules; these problems can be cast as variants of the rule selection problem considered here.
Summary of Results
We formalize the rule selection problem with rules specified by Horn formulas of first-order logic; the relation symbols in the premises of the Horn formulas come from a premise schema, while those in the conclusions come from a conclusion schema that is disjoint from the premise schema. This formalization captures rule selection problems in a variety of contexts.
An input to the rule selection problem consists of a finite set of rules and a pair of a premise database and a conclusion database that represents ground truth. When a subset of is evaluated on the premise instance , it produces a conclusion instance Eval. The set \mbox{{Eval}({\mathcal{C}^{\prime}},I)}\setminus J is the set of the false positive errors, while the set J\setminus\mbox{{Eval}({\mathcal{C}^{\prime}},I)} is the set of false negative errors. We study the optimization problem Min Rule-Select in which, given , as above, the goal is to find a subset of so that the number of false positive and false negative errors is minimized. We also study the optimization problem Min Rule-Select in which the goal is to find a subset of so that the number of false positive errors is minimized and there are no false negative errors (this is meaningful when J\subseteq\mbox{{Eval}({\mathcal{C}},I)}).
To gauge the difficulty of these two optimization problems, we first examine their underlying decision problems. We show that the decision problems involving a bound on the error are NP-hard. We also show that the exact decision problems asking if the error is equal to a given value are DP-hard; in particular, they are both NP-hard and coNP-hard (thus, unlikely to be in ). In view of these hardness results, we focus on the approximation properties of the two rule selection optimization problems. We show that, in a precise sense, Min Rule-Select has the same approximation properties as the Red-Blue Set Cover problem, while Min Rule-Select has the same approximation properties as the Positive-Negative Partial Set Cover problem. These results yield both polynomial-time approximation algorithms and lower bounds for the approximability of our problems.
The preceding results focus on the minimization of the error produced by the selected rules. What if one wants to also take the size of the selected rules into account? Since error and size are qualitatively incomparable quantities, it is not meaningful to add them or even to take a linear combination of the two. Instead, we consider pairs of values of error and size that are Pareto optimal, that is, neither of these values can be decreased without increasing the other value at the same time. The Pareto front of an instance is the set of all Pareto optimal pairs. Even though the study of Pareto optimality has been a central theme of multi-objective optimization for decades, it appears that no such study has been carried out for rule selection problems in any of the contexts discussed earlier. Here, we initiate such a study and show that the following problem is DP-hard: given a set of rules, a pair of a premise database and a conclusion database, and a pair of integers, does belong to the Pareto front of Min Rule-Select? We also show that a similar DP-hardness result holds for Min Rule-Select.
Finally, we investigate a bi-level optimization version of Min Rule-Select, where one minimizes first the total error and then the size. We show that the following problem is DP-hard: given a set of rules, a pair of a premise database and a conclusion database, and a pair of integers, is the minimum possible error and is the minimum size of subsets of rules having the minimum error? We also show a similar DP-hardness result holds for the bi-level optimization version of Min Rule-Select.
The main results of this paper are summarized in Table The Complexity of Error Minimization, which can be found in a subsequent section.
Related Work
We already mentioned that (?) studied the mapping selection problem in the context of data exchange. In addition to considering rules specified by Horn formulas (GAV constraints in data exchange), they also considered richer rules in which the conclusion involves existential quantification over a conjunction of atoms (GLAV - global and local as view - constraints in data exchange). They established that the mapping selection problem is NP-hard even for GAV constraints, but did not explore approximation algorithms for this optimization problem; instead, they designed an algorithm that uses probabilistic soft logic (?) to solve a relaxation of the mapping selection problem and then carried out a detailed experimental evaluation of this approach. Its many technical merits notwithstanding, the work of (?) suffers from a serious drawback, namely, the objective function of the mapping selection problem is defined to be the sum of the size of the rules and the error (the number of the false positives and the false negatives). As stated earlier, however, size and error are qualitatively different quantities, thus it is simply not meaningful to add them, the same way it is not meaningful to add dollars and miles if one is interested in a hotel room near the White House and is trying to minimize the cost of the room and the distance from the White House. This is why, to avoid this pitfall here, we first focus on error minimization alone and then study the Pareto optimality of pairs of size and error values.
In the rule selection problem, the aim is to select a set of rules from a larger set of candidate rules based on some given data. There is a large body of work on the problem of deriving a set of rules in data exchange and data integration from just one or more given data examples. There are several different approaches to this problem, including casting it as an optimization problem (?; ?), as a “fitting” problem (?; ?), as an interactive derivation problem (?), or as a learning problem (?; ?). Clearly, this is a related but rather different problem because, in contrast to the rule selection problem, no candidate rules are part of the input.
Basic Concepts and Algorithmic Problems
Schemas and Instances A schema R is a set of relation symbols, each with a specified arity indicating the number of its arguments. An R-instance is a set of relations whose arities match those of the corresponding relation symbols. An R-instance can be identified with the set of all facts , such that is a relation symbol in R and is a tuple in the relation of interpreting the relation symbol .
Rules. Let S and T be two disjoint relational schemas. In the rest of the paper, we will refer to S as the premise schema and to T as the conclusion schema. A rule over S and T is a Horn formula of first-order logic of the form
[TABLE]
where the premise is a conjunction of atoms over S and the conclusion is a single atom over T with variables among those in x. For example, the rule
[TABLE]
asserts that contains all pairs of nodes connected via an -path of length . For simplicity, we will be dropping the universal quantifiers , so that the preceding rule about paths of length will be written as .
The atoms in the premises may contain constants or they may be built-in predicates, such as jaccardSim. However, none of the lower-bound complexity results established here uses such atoms, while the upper-bound complexity results hold true even in the presence of such atoms, provided the built-in predicates are polynomial-time computable.
The size of a rule , denoted by , is the number of atoms in the premise of . The size of a collection of rules, denoted by , is the sum of the sizes of the rules in .
Data example. A data example is a pair , where is an instance over the premise schema S and is an instance over the conclusion schema T.
Rule evaluation. Given a rule and an instance , we write Eval to denote the result of evaluating the premise of on and then populating the conclusion of accordingly. For example, if is the rule and is a graph, then Eval is the set consisting of all pairs of nodes of connected via a path of length . If is a set of rules, then Eval is the set of facts . In data exchange, computing Eval amounts to running the chase procedure (?).
In general, given a collection of rules and an instance , computing Eval is an exponential-time task; the source of the exponentiality is the maximum number of atoms in the premises of the rules in and the maximum arity of the relation symbols in the conclusions of the rules. If, however, both these quantities are bounded by constants, then Eval is computable in polynomial time, according to the following fact (e.g., see (?)).
Proposition 1**.**
Let and be two fixed positive integers. Then the following problem is solvable in polynomial time: given an instance and a collection of rules such that the maximum number of atoms in the premises of rules in is at most and the maximum arity of the relation symbols in the conclusions of these rules is at most , compute Eval.
False positive errors and false negative errors. Given a collection of rules and a data example , a false positive error is a fact in Eval that is not in , while a false negative error is a fact in that is not in Eval. We write and (or, simply, FP and FN) for the set of false positive and false negative errors with of with respect to , that is,
[TABLE]
We will focus on the following two optimization problems concerning the minimization of the number of errors.
Definition 1**.**
[Min Rule-Select]
Input: A set of rules and a data example .
Goal: Find a subset such that the sum of the number of the false positive errors and the number of false negative errors of with respect to , is minimized.***
A feasible solution of a given instance , of Min Rule-Select(a,r) is a subset of . We write to denote the error of with respect to , i.e., the sum of the number of the false positive errors and the number of false negative errors.
Definition 2**.**
[Min Rule-Select]
Input: A set of rules and a data example such that Eval.
Goal: Find a subset such that the number of false negative errors is zero and the number of false positive errors of with respect to is minimized.***
A feasible solution of a given instance , of Min Rule-Select(a,r) is a subset of such that Eval. Feasible solutions always exist because is one. We write to denote the error of with respect to , i.e., the number of the false positive errors.
We do not consider Min Rule-Select, i.e., the optimization problem that aims to minimize the number of false negative errors. The reason is that there is a trivial solution to this problem, namely, we can select all the rules (if the number of false positive errors is not required to be zero) or select all the rules that produce no false positive errors (if the number of false positive errors is required to be zero).
To gauge the difficulty of solving an optimization problem, one often studies two decision problems that underlie the optimization problem at hand: a decision problem about bounds on the optimum value and a decision problem about the exact optimum value. We introduce these two problems for each of the optimization problems in Definitions 1 and 2.
Definition 3**.**
Given a set of rules, a data example , and an integer ,*
- •
Rule-Select asks: is there a subset of such that ?
- •
Exact Rule-Select asks: is the optimum value of Min Rule-Select on and equal to ?
Definition 4**.**
Given a set of rules, a data example such that Eval, and an integer ,*
- •
Rule-Select asks: is there a subset such that the number of false negative errors of with respect to is zero and ?
- •
Exact Rule-Select asks: is the optimum value of Min Rule-Select on and equal to ?
The Complexity of Error Minimization
We will investigate the computational complexity of the decision problems introduced in Definitions 3 and 4 by considering parameterized versions of these problems with parameters the maximum number of atoms in the premises of the rules and the maximum arity of the relation symbols in the conclusion of the rules.
Definition 5**.**
Let and be two fixed positive integers.*
- •
Rule-Select(a,r) is the restriction of Rule-Select to inputs in which the maximum number of atoms in the premises of rules in the given set of rules is at most and the maximum arity of the relation symbols in the conclusions of these rules is at most .
- •
The decision problems Rule-Select(a,r), Exact Rule-Select(a,r), Exact Rule-Select(a,r), Min Rule-Select(a,r), and Min Rule-Select(a,r) are defined in an analogous way.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[Agrawal, Rantzau, and Terzi 2006] Agrawal, R.; Rantzau, R.; and Terzi, E. 2006. Context-sensitive ranking. In Chaudhuri, S.; Hristidis, V.; and Polyzotis, N., eds., Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, June 27-29, 2006 , 383–394. ACM.
- 2[Alexe et al . 2011 a] Alexe, B.; ten Cate, B.; Kolaitis, P. G.; and Tan, W. C. 2011 a. Characterizing schema mappings via data examples. ACM Trans. Database Syst. 36(4):23:1–23:48.
- 3[Alexe et al . 2011 b] Alexe, B.; ten Cate, B.; Kolaitis, P. G.; and Tan, W. C. 2011 b. Designing and refining schema mappings via data examples. In Sellis, T. K.; Miller, R. J.; Kementsietsidis, A.; and Velegrakis, Y., eds., Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011 , 133–144. ACM.
- 4[Arasu, Re, and Suciu 2009] Arasu, A.; Re, C.; and Suciu, D. 2009. Large-Scale Deduplication with Constraints using Dedupalog. In ICDE , 952–963.
- 5[Arora and Barak 2009] Arora, S., and Barak, B. 2009. Computational Complexity - A Modern Approach . Cambridge University Press.
- 6[Bach et al . 2017] Bach, S. H.; Broecheler, M.; Huang, B.; and Getoor, L. 2017. Hinge-loss markov random fields and probabilistic soft logic. Journal of Machine Learning Research 18:109:1–109:67.
- 7[Bilenko, Kamath, and Mooney 2006] Bilenko, M.; Kamath, B.; and Mooney, R. J. 2006. Adaptive blocking: Learning to scale up record linkage. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 2006), 18-22 December 2006, Hong Kong, China , 87–96. IEEE Computer Society.
- 8[Bonifati et al . 2017] Bonifati, A.; Comignani, U.; Coquery, E.; and Thion, R. 2017. Interactive mapping specification with exemplar tuples. In Proceedings of the 2017 ACM International Conference on Management of Data , SIGMOD ’17, 667–682. New York, NY, USA: ACM.
