Learning Fair Naive Bayes Classifiers by Discovering and Eliminating   Discrimination Patterns

YooJung Choi; Golnoosh Farnadi; Behrouz Babaki; Guy Van den Broeck

arXiv:1906.03843·cs.LG·May 11, 2020

Learning Fair Naive Bayes Classifiers by Discovering and Eliminating Discrimination Patterns

YooJung Choi, Golnoosh Farnadi, Behrouz Babaki, Guy Van den Broeck

PDF

1 Repo

TL;DR

This paper introduces a method to ensure fairness in naive Bayes classifiers by discovering and removing discrimination patterns, allowing for fairer decision-making even with partial feature observations.

Contribution

It proposes a novel approach to identify and eliminate discrimination patterns in naive Bayes classifiers, enhancing fairness without extensive feature observation.

Findings

01

Successfully removes exponentially many discrimination patterns

02

Achieves fairness with minimal additional constraints

03

Demonstrates effectiveness on real-world datasets

Abstract

As machine learning is increasingly used to make real-world decisions, recent research efforts aim to define and ensure fairness in algorithmic decision making. Existing methods often assume a fixed set of observable features to define individuals, but lack a discussion of certain features not being observed at test time. In this paper, we study fairness of naive Bayes classifiers, which allow partial observations. In particular, we introduce the notion of a discrimination pattern, which refers to an individual receiving different classifications depending on whether some sensitive attributes were observed. Then a model is considered fair if it has no such pattern. We propose an algorithm to discover and mine for discrimination patterns in a naive Bayes classifier, and show how to learn maximum likelihood parameters subject to these fairness constraints. Our approach iteratively…

Tables4

Table 1. Table 1: Data statistics (number of training instances, sensitive features S 𝑆 S , non-sensitive features N 𝑁 N , and potential patterns) and the proportion of patterns explored during the search, using the Divergence and Discrimination scores as rankings.

Dataset Statistics						Proportion of search space explored
						Divergence			Discrimination
Dataset	Size	$S$	$N$	# Pat.	$k$	$δ = 0.01$	$δ = 0.05$	$δ = 0.10$	$δ = 0.01$	$δ = 0.05$	$δ = 0.10$
COMPAS	48,834	4	3	15K	1	6.387e-01	5.634e-01	3.874e-01	8.188e-03	8.188e-03	8.188e-03
					10	7.139e-01	5.996e-01	4.200e-01	3.464e-02	3.464e-02	3.464e-02
					100	8.222e-01	6.605e-01	4.335e-01	9.914e-02	9.914e-02	9.914e-02
Adult	32,561	4	9	11M	1	3.052e-06	7.260e-06	1.248e-05	2.451e-04	2.451e-04	2.451e-04
					10	7.030e-06	1.154e-05	1.809e-05	2.467e-04	2.467e-04	2.467e-04
					100	1.458e-05	1.969e-05	2.509e-05	2.600e-04	2.600e-04	2.597e-04
German	1,000	4	16	23B	1	5.075e-07	2.731e-06	2.374e-06	7.450e-08	7.450e-08	7.450e-08
					10	9.312e-07	3.398e-06	2.753e-06	1.592e-06	1.592e-06	1.592e-06
					100	1.454e-06	4.495e-06	3.407e-06	5.897e-06	5.897e-06	5.897e-06

Table 2. Table 2 : Log-likelihood of models learned without fairness constraints, with the δ 𝛿 \delta -fair learner ( δ = 0.1 𝛿 0.1 \delta=0.1 ), and by making sensitive variables independent from the decision variable.

Dataset	Unconstrained	$δ$ -fair	Independent
COMPAS	-207,055	-207,395	-208,639
Adult	-226,375	-228,763	-232,180
German	-12,630	-12,635	-12,649

Table 3. Table 3 : Number of remaining patterns with δ = 0.1 𝛿 0.1 \delta\!=\!0.1 in naive Bayes models trained on discrimination-free data, where λ 𝜆 \lambda determines the trade-off between fairness and accuracy in the data repair step ( ? ).

Dataset	$λ =$ 0.5	$λ =$ 0.9	$λ =$ 0.95	$λ =$ 0.99
COMPAS	2,504	2,471	2,470	3,069
Adult	>1e6	661	652	605
German	>1e6	3	2	0

Table 4. Table 4 : Comparing accuracy of our δ 𝛿 \delta -fair models with two-naive-Bayes method and a naive Bayes model trained on repaired, discrimination-free data.

dataset	Unconstrained	2NB	Repaired	$δ$ -fair
COMPAS	0.880	0.875	0.878	0.879
Adult	0.811	0.759	0.325	0.827
German	0.690	0.679	0.688	0.696

Equations132

Δ_{P, d} (x, y) ≜ P (d ∣ xy) - P (d ∣ y) .

Δ_{P, d} (x, y) ≜ P (d ∣ xy) - P (d ∣ y) .

l \leq γ \leq u min Δ (P (x x_{l}^{'} ∣ d), P (x x_{l}^{'} ∣ \overline{d}), γ) \leq Δ_{P, d} (x x^{'}, y y^{'})

l \leq γ \leq u min Δ (P (x x_{l}^{'} ∣ d), P (x x_{l}^{'} ∣ \overline{d}), γ) \leq Δ_{P, d} (x x^{'}, y y^{'})

\leq l \leq γ \leq u max Δ (P (x x_{u}^{'} ∣ d), P (x x_{u}^{'} ∣ \overline{d}), γ),

Div_{P, d, δ} (x, y) ≜ Q min

Div_{P, d, δ} (x, y) ≜ Q min

∣ Δ_{Q, d} (x, y) ∣ \leq δ

P (d z) = Q (d z), \forall d z \neq ⊨ xy

Div_{P, d, δ} (x, y) =

Div_{P, d, δ} (x, y) =

+ P (\overline{d} xy) lo g (\frac{P ( d xy )}{P ( d xy ) - r}),

minimize f_{0} (x), s.t. f_{i} (x) \leq 1, g_{j} (x) = 1 \forall i, j

minimize f_{0} (x), s.t. f_{i} (x) \leq 1, g_{j} (x) = 1 \forall i, j

r_{x} = \frac{\prod _{x} θ _{x ∣ \overset{ˉ}{d}}}{\prod _{x} θ _{x ∣ d}}, r_{y} = \frac{θ _{\overset{ˉ}{d} ∣} \prod _{y} θ _{y ∣ \overset{ˉ}{d}}}{θ _{d ∣} \prod _{y} θ _{y ∣ d}}

r_{x} = \frac{\prod _{x} θ _{x ∣ \overset{ˉ}{d}}}{\prod _{x} θ _{x ∣ d}}, r_{y} = \frac{θ _{\overset{ˉ}{d} ∣} \prod _{y} θ _{y ∣ \overset{ˉ}{d}}}{θ _{d ∣} \prod _{y} θ _{y ∣ d}}

(\frac{1 - δ}{δ}) r_{x} r_{y} - (\frac{1 + δ}{δ}) r_{y} - r_{x} r_{y}^{2} \leq 1,

- (\frac{1 + δ}{δ}) r_{x} r_{y} + (\frac{1 - δ}{δ}) r_{y} - r_{x} r_{y}^{2} \leq 1.

Δ_{P, d} (x, y) = P (d ∣ xy) - P (d ∣ y)

Δ_{P, d} (x, y) = P (d ∣ xy) - P (d ∣ y)

= \frac{P ( x ∣ d ) P ( d y )}{P ( x ∣ d ) P ( d y ) + P ( x ∣ d ) P ( d y )} - P (d ∣ y)

= \frac{P ( x ∣ d ) P ( d ∣ y )}{P ( x ∣ d ) P ( d ∣ y ) + P ( x ∣ d ) P ( d ∣ y )} - P (d ∣ y)

= Δ (P (x ∣ d), P (x ∣ \overline{d}), P (d ∣ y))

l \leq γ \leq u min Δ (P (x ∣ d), P (x ∣ \overline{d}), γ)

l \leq γ \leq u min Δ (P (x ∣ d), P (x ∣ \overline{d}), γ)

\leq Δ (P (x ∣ d), P (x ∣ \overline{d}), P (d ∣ y y^{'})) = Δ_{P, d} (x, y y^{'})

\leq l \leq γ \leq u max Δ (P (x ∣ d), P (x ∣ \overline{d}), γ) .

l \leq γ \leq u min Δ (P (x x_{l}^{'} ∣ d), P (x x_{l}^{'} ∣ \overline{d}), γ)

l \leq γ \leq u min Δ (P (x x_{l}^{'} ∣ d), P (x x_{l}^{'} ∣ \overline{d}), γ)

\leq Δ (P (x x_{l}^{'} ∣ d), P (x x_{l}^{'} ∣ \overline{d}), P (d ∣ y y^{'}))

= Δ_{P, d} (x x_{l}^{'}, y y^{'})

= P (d ∣ x x_{l}^{'} y y^{'}) - P (d ∣ y y^{'})

\leq P (d ∣ x x^{'} y y^{'}) - P (d ∣ y y^{'})

= Δ_{P, d} (x x^{'}, y y^{'})

\leq Δ_{P, d} (x x_{u}^{'}, y y^{'})

= Δ (P (x x_{u}^{'} ∣ d), P (x x_{u}^{'} ∣ \overline{d}), P (d ∣ y y^{'}))

\leq l \leq γ \leq u max Δ (P (x x_{u}^{'} ∣ d), P (x x_{u}^{'} ∣ \overline{d}), γ) .

\frac{d}{d γ} Δ_{α, β} (γ) = \frac{α β}{( α γ + β ( 1 - γ ) ) ^{2}} - 1,

\frac{d}{d γ} Δ_{α, β} (γ) = \frac{α β}{( α γ + β ( 1 - γ ) ) ^{2}} - 1,

\frac{d ^{2}}{d γ ^{2}} Δ_{α, β} (γ) = \frac{- 2 α β ( α - β )}{( α γ + β ( 1 - γ ) ) ^{3}}

Δ_{α, β} (γ_{opt}) = \frac{α ( \frac{β - α β}{β - α} )}{( α - β ) ( \frac{β - α β}{β - α} ) + β} - \frac{β - α β}{β - α}

Δ_{α, β} (γ_{opt}) = \frac{α ( \frac{β - α β}{β - α} )}{( α - β ) ( \frac{β - α β}{β - α} ) + β} - \frac{β - α β}{β - α}

= \frac{α ( β - α β )}{α β ( β - α )} - \frac{β - α β}{β - α} = \frac{2 α β - α - β}{β - α}

P (d ∣ v w)

P (d ∣ v w)

= \frac{1}{1 + \frac{P ( v ∣ d ) P ( d ∣ w )}{P ( v ∣ d ) P ( d ∣ w )}}

v arg max P (d ∣ v w)

v arg max P (d ∣ v w)

= v arg max \frac{P ( v ∣ d )}{P ( v ∣ d )} .

KL (P ∥ Q) = d z \sum P (d z) lo g (\frac{P ( d z )}{Q ( d z )})

KL (P ∥ Q) = d z \sum P (d z) lo g (\frac{P ( d z )}{Q ( d z )})

= P (d xy) lo g (\frac{P ( d xy )}{Q ( d xy )}) + P (\overline{d} xy) lo g (\frac{P ( d xy )}{Q ( d xy )})

g_{P, d, x, y} (r) ≜

g_{P, d, x, y} (r) ≜

+ P (\overline{d} xy) lo g (\frac{P ( d xy )}{P ( d xy ) - r})

Q (d ∣ xy) - Q (d ∣ y) = \frac{P ( d xy ) + r}{P ( xy )} - \frac{P ( d y ) + r}{P ( y )}

Q (d ∣ xy) - Q (d ∣ y) = \frac{P ( d xy ) + r}{P ( xy )} - \frac{P ( d y ) + r}{P ( y )}

= P (d ∣ xy) - P (d ∣ y) + r (\frac{1}{P ( xy )} - \frac{1}{P ( y )})

= Δ_{P, d} (x, y) + r (\frac{1}{P ( xy )} - \frac{1}{P ( y )}) .

r min

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

UCLA-StarAI/LearnFairNB
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Learning Fair Naive Bayes Classifiers by

Discovering and Eliminating Discrimination Patterns

YooJung Choi,1 Golnoosh Farnadi,2,311footnotemark: 1 Behrouz Babaki,411footnotemark: 1 and Guy Van den Broeck1

1University of California, Los Angeles, 2Mila, 3Université de Montréal, 4Polytechnique Montréal

[email protected], [email protected], [email protected], [email protected] Equal contribution

Abstract

As machine learning is increasingly used to make real-world decisions, recent research efforts aim to define and ensure fairness in algorithmic decision making. Existing methods often assume a fixed set of observable features to define individuals, but lack a discussion of certain features not being observed at test time. In this paper, we study fairness of naive Bayes classifiers, which allow partial observations. In particular, we introduce the notion of a discrimination pattern, which refers to an individual receiving different classifications depending on whether some sensitive attributes were observed. Then a model is considered fair if it has no such pattern. We propose an algorithm to discover and mine for discrimination patterns in a naive Bayes classifier, and show how to learn maximum-likelihood parameters subject to these fairness constraints. Our approach iteratively discovers and eliminates discrimination patterns until a fair model is learned. An empirical evaluation on three real-world datasets demonstrates that we can remove exponentially many discrimination patterns by only adding a small fraction of them as constraints.

1 Introduction

With the increasing societal impact of machine learning come increasing concerns about the fairness properties of machine learning models and how they affect decision making. For example, concerns about fairness come up in policing (?), recidivism prediction (?), insurance pricing (?), hiring (?), and credit rating (?). The algorithmic fairness literature has proposed various solutions, from limiting the disparate treatment of similar individuals to giving statistical guarantees on how classifiers behave towards different populations. Key approaches include individual fairness (?; ?), statistical parity, disparate impact and group fairness (?; ?; ?), counterfactual fairness (?), preference-based fairness (?), relational fairness (?), and equality of opportunity (?). The goal in these works is usually to assure the fair treatment of individuals or groups that are identified by sensitive attributes.

In this paper, we study fairness properties of probabilistic classifiers that represent joint distributions over the features and decision variable. In particular, Bayesian network classifiers treat the classification or decision-making task as a probabilistic inference problem: given observed features, compute the probability of the decision variable. Such models have a unique ability that they can naturally handle missing features, by simply marginalizing them out of the distribution when they are not observed at prediction time. Hence, a Bayesian network classifier effectively embeds exponentially many classifiers, one for each subset of observable features. We ask whether such classifiers exhibit patterns of discrimination where similar individuals receive markedly different outcomes purely because they disclosed a sensitive attribute.

The first key contribution of this paper is an algorithm to verify whether a Bayesian classifier is fair, or else to mine the classifier for discrimination patterns. We propose two alternative criteria for identifying the most important discrimination patterns that are present in the classifier. We specialize our pattern miner to efficiently discover discrimination patterns in naive Bayes models using branch-and-bound search. These classifiers are often used in practice because of their simplicity and tractability, and they allow for the development of effective bounds. Our empirical evaluation shows that naive Bayes models indeed exhibit vast numbers of discrimination patterns, and that our pattern mining algorithm is able to find them by traversing only a small fraction of the search space.

The second key contribution of this paper is a parameter learning algorithm for naive Bayes classifiers that ensures that no discrimination patterns exist in the the learned distribution. We propose a signomial programming approach to eliminate individual patterns of discrimination during maximum-likelihood learning. Moreover, to efficiently eliminate the exponential number of patterns that could exist in a naive Bayes classifier, we propose a cutting-plane approach that uses our discrimination pattern miner to find and iteratively eliminate discrimination patterns until the entire learned model is fair. Our empirical evaluation shows that this process converges in a small number of iteration, effectively removing millions of discrimination patterns. Moreover, the learned fair models are of high quality, achieving likelihoods that are close to the best likelihoods attained by models with no fairness constraints. Our method also achieves higher accuracy than other methods of learning fair naive Bayes models.

2 Problem Formalization

We use uppercase letters for random variables and lowercase letters for their assignments. Sets of variables and their joint assignments are written in bold. Negation of a binary assignment $x$ is denoted $\bar{x}$ , and $\mathbf{x}\!\models\!\mathbf{y}$ means that $\mathbf{x}$ logically implies $\mathbf{y}$ . Concatenation of sets $\mathbf{XY}$ denotes their union.

Each individual is characterized by an assignment to a set of discrete variables $\mathbf{Z}$ , called attributes or features. Assignment $d$ to a binary decision variable $D$ represents a decision made in favor of the individual (e.g., a loan approval). A set of sensitive attributes $\mathbf{S}\subset\mathbf{Z}$ specifies a group of entities protected often by law, such as gender and race. We now define the notion of a discrimination pattern.

Definition 1.

Let $P$ be a distribution over $D\cup\mathbf{Z}$ . Let $\mathbf{x}$ and $\mathbf{y}$ be joint assignments to $\mathbf{X}\!\subseteq\!\mathbf{S}$ and $\mathbf{Y}\!\subseteq\!\mathbf{Z}\!\setminus\!\mathbf{X}$ , respectively. The degree of discrimination of $\mathbf{x}\mathbf{y}$ is:

[TABLE]

The assignment $\mathbf{y}$ identifies a group of similar individuals, and the degree of discrimination quantifies how disclosing sensitive information $\mathbf{x}$ affects the decision for this group. Note that sensitive attributes missing from $\mathbf{x}$ can still appear in $\mathbf{y}$ . We drop the subscripts $P,d$ when clear from context.

Definition 2.

Let $P$ be a distribution over $D\cup\mathbf{Z}$ , and $\delta\in[0,1]$ a threshold. Joint assignments $\mathbf{x}$ and $\mathbf{y}$ form a discrimination pattern w.r.t. $P$ and $\delta$ if: (1) $\mathbf{X}\!\subseteq\!\mathbf{S}$ and $\mathbf{Y}\!\subseteq\!\mathbf{Z}\!\setminus\!\mathbf{X}$ ; and (2) $\left\lvert\Delta_{P,d}(\mathbf{x},\mathbf{y})\right\rvert>\delta$ .

Intuitively, we do not want information about the sensitive attributes to significantly affect the probability of getting a favorable decision. Let us consider two special cases of discrimination patterns. First, if $\mathbf{Y}\!=\!\emptyset$ , then a small discrimination score $\left\lvert\Delta(\mathbf{x},\emptyset)\right\rvert$ can be interpreted as an approximation of statistical parity, which is achieved when $P(d\,|\,\mathbf{x})=P(d)$ . For example, the naive Bayes network in Figure 1 satisfies approximate parity for $\delta\!=\!0.2$ as $\left\lvert\Delta(x,\emptyset)\right\rvert\!=\!0.086\leq\delta$ and $\left\lvert\Delta(\bar{x},\emptyset)\right\rvert\!=\!0.109\leq\delta$ . Second, suppose $\mathbf{X}\!=\!\mathbf{S}$ and $\mathbf{Y}\!=\!\mathbf{Z}\!\setminus\!\mathbf{S}$ . Then bounding $\left\lvert\Delta(\mathbf{x},\mathbf{y})\right\rvert$ for all joint states $\mathbf{x}$ and $\mathbf{y}$ is equivalent to enforcing individual fairness assuming two individuals are considered similar if their non-sensitive attributes $\mathbf{y}$ are equal. The network in Figure 1 is also individually fair for $\delta=0.2$ because $\max_{xy_{1}y_{2}}\left\lvert\Delta(x,y_{1}y_{2})\right\rvert\!=\!0.167\leq\delta$ .111The highest discrimination score is observed at $\bar{x}$ and $y_{1}\bar{y_{2}}$ , with $\Delta(\bar{x},y_{1}\bar{y_{2}})=-0.167$ . We discuss these connections more in Section 5.

Even though the example network is considered (approximately) fair at the group level nor at the individual level with fully observed features, it may still produce a discrimination pattern. In particular, $\left\lvert\Delta(\bar{x},y_{1})\right\rvert\!=\!0.225>\delta$ . That is, a person with $\bar{x}$ and $y_{1}$ observed and the value of $Y_{2}$ undisclosed would receive a much more favorable decision had they not disclosed $X$ as well. Hence, naturally we wish to ensure that there exists no discrimination pattern across all subsets of observable features.

Definition 3.

A distribution $P$ is $\delta$ -fair if there exists no discrimination pattern w.r.t $P$ and $\delta$ .

Although our notion of fairness applies to any distribution, finding discrimination patterns can be computationally challenging: computing the degree of discrimination involves probabilistic inference, which is hard in general, and a given distribution may have exponentially many patterns. In this paper, we demonstrate how to discover and eliminate discrimination patterns of a naive Bayes classifier effectively by exploiting its independence assumptions. Concretely, we answer the following questions: (1) Can we certify that a classifier is $\delta$ -fair?; (2) If not, can we find the most important discrimination patterns?; (3) Can we learn a naive Bayes classifier that is entirely $\delta$ -fair?

3 Discovering Discrimination Patterns and Verifying $\delta$ -fairness

This section describes our approach to finding discrimination patterns or checking that there are none.

3.1 Searching for Discrimination Patterns

One may naively enumerate all possible patterns and compute their degrees of discrimination. However, this would be very inefficient as there are exponentially many subsets and assignments to consider. We instead use branch-and-bound search to more efficiently decide if a model is fair.

Algorithm 1 finds discrimination patterns. It recursively adds variable instantiations and checks the discrimination score at each step. If the input distribution is $\delta$ -fair, the algorithm returns no pattern; otherwise, it returns the set of all discriminating patterns. Note that computing $\Delta$ requires probabilistic inference on distribution $P$ . This can be done efficiently for large classes of graphical models (?; ?; ?; ?; ?), and particularly for naive Bayes networks, which will be our main focus.

Furthermore, the algorithm relies on a good upper bound to prune the search tree and avoid enumerating all possible patterns. Here, $\text{UB}(\mathbf{x},\mathbf{y},\mathbf{E})$ bounds the degree of discrimination achievable by observing more features after $\mathbf{x}\mathbf{y}$ while excluding features $\mathbf{E}$ .

Proposition 1.

Let $P$ be a naive Bayes distribution over $D\cup\mathbf{Z}$ , and let $\mathbf{x}$ and $\mathbf{y}$ be joint assignments to $\mathbf{X}\!\subseteq\!\mathbf{S}$ and $\mathbf{Y}\!\subseteq\!\mathbf{Z}\!\setminus\!\mathbf{X}$ . Let $\mathbf{x}_{u}^{\prime}$ (resp. $\mathbf{x}_{l}^{\prime}$ ) be an assignment to $\mathbf{X}^{\prime}\!=\!\mathbf{S}\!\setminus\!\mathbf{X}$ that maximizes (resp. minimizes) $P(d\,|\,\mathbf{x}\mathbf{x}^{\prime})$ . Suppose $l,u\in[0,1]$ such that $l\leq P(d\,|\,\mathbf{y}\mathbf{y}^{\prime})\leq u$ for all possible assignments $\mathbf{y}^{\prime}$ to $\mathbf{Y}^{\prime}\!=\!\mathbf{Z}\!\setminus\!(\mathbf{X}\mathbf{Y})$ . Then the degrees of discrimination for all patterns $\mathbf{x}\mathbf{x}^{\prime}\mathbf{y}\mathbf{y}^{\prime}$ that extend $\mathbf{x}\mathbf{y}$ are bounded as follows:

[TABLE]

where $\widetilde{\Delta}(\alpha,\beta,\gamma)\triangleq\frac{\alpha\gamma}{\alpha\gamma+\beta(1-\gamma)}-\gamma$ .

Here, $\widetilde{\Delta}:[0,1]^{3}\to[0,1]$ is introduced to relax the discrete problem of minimizing or maximizing the degree of discrimination into a continuous one. In particular, $\widetilde{\Delta}\left(P(\mathbf{x}|d),P(\mathbf{x}|\overline{d}),P(d|\mathbf{y})\right)$ equals the degree of discrimination $\Delta(\mathbf{x},\mathbf{y})$ . This relaxation allows us to compute bounds efficiently, as closed-form solutions. We refer to the Appendix for full proofs and details.

To apply above proposition, we need to find $\mathbf{x}_{u}^{\prime},\mathbf{x}_{l}^{\prime},l,u$ by maximizing/minimizing $P(d|\mathbf{x}\mathbf{x}^{\prime})$ and $P(d|\mathbf{y}\mathbf{y}^{\prime})$ for a given pattern $\mathbf{x}\mathbf{y}$ . Fortunately, this can be done efficiently for naive Bayes classifiers.

Lemma 1.

Given a naive Bayes distribution $P$ over $D\!\cup\!\mathbf{Z}$ , a subset $\mathbf{V}\!=\!\{V_{i}\}_{i=1}^{n}\!\subset\!\mathbf{Z}$ , and an assignment $\mathbf{w}$ to $\mathbf{W}\!\subseteq\!\mathbf{Z}\!\setminus\!\mathbf{V}$ , we have: $\operatorname*{arg\,max}_{\mathbf{v}}P(d|\mathbf{v}\mathbf{w})=\left\{\operatorname*{arg\,max}_{v_{i}}P(v_{i}|d)/P(v_{i}|\overline{d})\right\}_{i=1}^{n}$ .

That is, the joint observation $\mathbf{v}$ that will maximize the probability of the decision can be found by optimizing each variable $V_{i}$ independently; the same holds when minimizing. Hence, we can use Proposition 1 to compute upper bounds on discrimination scores of extended patterns in linear time.

3.2 Searching for Top- $k$ Ranked Patterns

If a distribution is significantly unfair, Algorithm 1 may return exponentially many discrimination patterns. This is not only very expensive but makes it difficult to interpret the discrimination patterns. Instead, we would like to return a smaller set of “interesting” discrimination patterns.

An obvious choice is to return a small number of discrimination patterns with the highest absolute degree of discrimination. Searching for the $k$ most discriminating patterns can be done with a small modification to Algorithm 1. First, the size of list $L$ is limited to $k$ . The conditions in Lines 3–7 are modified to check the current discrimination score and upper bounds against the smallest discrimination score of patterns in $L$ , instead of the threshold $\delta$ .

Nevertheless, ranking patterns by their discrimination score may return patterns of very low probability. For example, the most discriminating pattern of a naive Bayes classifier learned on the COMPAS dataset222https://github.com/propublica/compas-analysis has a high discrimination score of 0.42, but only has a 0.02% probability of occurring.333The corresponding pattern is $\mathbf{x}\!=\!\{\text{White},\text{Married},\text{Female},\allowbreak{>\!30\text{ y/o}\}},\mathbf{y}\!=\!\{\text{Probation, Pretrial}\}$ . The probability of a discrimination pattern denotes the proportion of the population (according to the distribution) that could be affected unfairly, and thus a pattern with extremely low probability could be of lesser interest. To address this concern, we propose a more sophisticated ranking of the discrimination patterns that also takes into account the probabilities of patterns.

Definition 4.

Let $P$ be a distribution over $D\cup\mathbf{Z}$ . Let $\mathbf{x}$ and $\mathbf{y}$ be joint instantiations to subsets $\mathbf{X}\subseteq\mathbf{S}$ and $\mathbf{Y}\subseteq\mathbf{Z}\setminus\mathbf{X}$ , respectively. The divergence score of $\mathbf{x}\mathbf{y}$ is:

[TABLE]

where $\operatorname{KL}\left(P\;\middle\|\;Q\right)=\sum_{d,\mathbf{z}}P(d\mathbf{z})\log(P(d\mathbf{z})/Q(d\mathbf{z}))$ .

The divergence score assigns to a pattern $\mathbf{x}\mathbf{y}$ the minimum Kullback-Leibler (KL) divergence between current distribution $P$ and a hypothetical distribution $Q$ that is fair on the pattern $\mathbf{x}\mathbf{y}$ and differs from $P$ only on the assignments that satisfy the pattern (namely $d\mathbf{x}\mathbf{y}$ and $\overline{d}\mathbf{x}\mathbf{y}$ ). Informally, the divergence score approximates how much the current distribution $P$ needs to be changed in order for $\mathbf{x}\mathbf{y}$ to no longer be a discrimination pattern. Hence, patterns with higher divergence score will tend to have not only higher discrimination score but also higher probabilities.

For instance, the pattern with the highest divergence score 444 $\mathbf{x}=\{\text{Married,$ >30 $y/o}\}$ , $\mathbf{y}=\{\}$ . on the COMPAS dataset has a discrimination score of 0.19 which is not insignificant, but also has a relatively high probability of 3.33% – more than two orders of magnitude larger than that of the most discriminating pattern. Therefore, such a general pattern could be more interesting for the user studying this classifier.

To find the top- $k$ patterns with the divergence score, we need to be able to compute the score and its upper bound efficiently. The key insights are that KLD is convex and that $Q$ , in Equation 1, can freely differ from $P$ only on one probability value (either that of $d\mathbf{x}\mathbf{y}$ or $\overline{d}\mathbf{x}\mathbf{y}$ ). Then:

[TABLE]

where $r=0$ if $\left\lvert\Delta_{P,d}(\mathbf{x},\mathbf{y})\right\rvert\!\leq\!\delta$ ; $r=\frac{\delta-\Delta_{P,d}(\mathbf{x},\mathbf{y})}{1/P(\mathbf{x}\mathbf{y})-1/P(\mathbf{y})}$ if $\Delta_{P,d}(\mathbf{x},\mathbf{y})\!>\!\delta$ ; and $r=\frac{-\delta-\Delta_{P,d}(\mathbf{x},\mathbf{y})}{1/P(\mathbf{x}\mathbf{y})-1/P(\mathbf{y})}$ if $\Delta_{P,d}(\mathbf{x},\mathbf{y})\!<\!-\delta$ . Intuitively, $r$ represents the minimum necessary change to $P(d\mathbf{x}\mathbf{y})$ for $\mathbf{x}\mathbf{y}$ to be non-discriminating in the new distribution. Note that the smallest divergence score of 0 is attained when the pattern is already fair.

Lastly, we refer to the Appendix for two upper bounds of the divergence score, which utilize the bound on discrimination score of Proposition 1 and can be computed efficiently using Lemma 1.

3.3 Empirical Evaluation of Discrimination Pattern Miner

In this section, we report the experimental results on the performance of our pattern mining algorithms. All experiments were run on an AMD Opteron 275 processor (2.2GHz) and 4GB of RAM running Linux Centos 7. Execution time is limited to 1800 seconds.

Data and pre-processing. We use three datasets: The Adult dataset and German dataset are used for predicting income level and credit risk, respectively, and are obtained from the UCI machine learning repository555https://archive.ics.uci.edu/ml; the COMPAS dataset is used for predicting recidivism. These datasets have been commonly studied regarding fairness and were shown to exhibit some form of discrimination by several previous works (?; ?; ?; ?). As pre-processing, we removed unique features (e.g. names of individuals) and duplicate features.666The processed data, code, and Appendix are available at https://github.com/UCLA-StarAI/LearnFairNB. See Table 1 for a summary.

Q1. Does our pattern miner find discrimination patterns more efficiently than by enumerating all possible patterns? We answer this question by inspecting the fraction of all possible patterns that our pattern miner visits during the search. Table 1 shows the results on three datasets, using two rank heuristics (discrimination and divergence) and three threshold values (0.01, 0.05, and 0.1). The results are reported for mining the top- $k$ patterns when $k$ is 1, 10, and 100. A naive method has to enumerate all possible patterns to discover the discriminating ones, while our algorithm visits only a small fraction of patterns (e.g., one in every several millions on the German dataset).

Q2. Does the divergence score find discrimination patterns with both a high discrimination score and high probability? Figure 2 shows the probability and discrimination score of all patterns in the COMPAS dataset. The top-10 patterns according to three measures (discrimination score, divergence score, and probability) are highlighted in the figure. The observed trade-off between probability and discrimination score indicates that picking the top patterns according to each measure will yield low quality patterns according to the other measure. The divergence score, however, balances the two measures and returns patterns that have high probability and discrimination scores. Also observe that the patterns selected by the divergence score lie in the Pareto front. This in fact always holds by the definition of this heuristic; fixing the probability and increasing the discrimination score will also increase the divergence score, and vice versa.

4 Learning Fair Naive Bayes Classifiers

We now describe our approach to learning the maximum-likelihood parameters of a naive Bayes model from data while eliminating discrimination patterns. A common approach to learning naive Bayes models with certain properties is to formulate it as an optimization problem of certain form, for which efficient solvers are available (?). We formulate the learning subject to fairness constraints as a signomial program, which has the following form:

[TABLE]

where each $f_{i}$ is signomial while $g_{j}$ is monomial. A signomial is a function of the form $\sum_{k}c_{k}x_{1}^{a_{1k}}\cdots x_{n}^{a_{1n}}$ defined over real positive variables $x_{1}\ldots x_{n}$ where $c_{k},a_{ij}\in\mathbb{R}$ ; a monomial is of the form $cx_{1}^{a_{1}}\cdots x_{n}^{a_{n}}$ where $c>0$ and $a_{i}\in\mathbb{R}$ . Signomial programs are not globally convex, but a locally optimal solution can be computed efficiently, unlike the closely related class of geometric programs, for which the globally optimum can be found efficiently (?).

4.1 Parameter Learning with Fairness Constraints

The likelihood of a Bayesian network given data $\mathcal{D}$ is $P_{\theta}(\mathcal{D})\!=\!\prod_{i}\theta_{i}^{n_{i}}$ where $n_{i}$ is the number of examples in $\mathcal{D}$ that satisfy the assignment corresponding to parameter $\theta_{i}$ . To learn the maximum-likelihood parameters, we minimize the inverse of likelihood which is a monomial: $\theta_{\text{ml}}\!=\!\operatorname*{arg\,min}_{\theta}\prod_{i}\theta_{i}^{-n_{i}}$ . The parameters of a naive Bayes network with binary class consist of $\theta_{d\,|\,},\theta_{\bar{d}\,|\,}$ , and $\theta_{z\,|\,d},\theta_{z\,|\,\bar{d}}$ for all $z$ .

Next, we show the constraints for our optimization problem. To learn a valid distribution, we need to ensure that probabilities are non-negative and sum to one. The former assumption is inherent to signomial programs. To enforce the latter, for each instantiation $d$ and feature $Z$ , we need that $\sum_{z}\theta_{z\,|\,d}=1$ , or as signomial inequality constraints: $\sum_{z}\theta_{z\,|\,d}\leq 1$ and $2-\sum_{z}\theta_{z\,|\,d}\leq 1$ .

Finally, we derive the constraints to ensure that a given pattern $\mathbf{x}\mathbf{y}$ is non-discriminating.

Proposition 2.

Let $P_{\theta}$ be a naive Bayes distribution over $D\cup\mathbf{Z}$ , and let $\mathbf{x}$ and $\mathbf{y}$ be joint assignments to $\mathbf{X}\subseteq\mathbf{S}$ and $\mathbf{Y}\subseteq\mathbf{Z}\setminus\mathbf{X}$ . Then $\left\lvert\Delta_{P_{\theta},d}(\mathbf{x},\mathbf{y})\right\rvert\leq\delta$ for a threshold $\delta\in[0,1]$ iff the following holds:

[TABLE]

Note that above equalities and inequalities are valid signomial program constraints. Thus, we can learn the maximum-likelihood parameters of a naive Bayes network while ensuring a certain pattern is fair by solving a signomial program. Furthermore, we can eliminate multiple patterns by adding the constraints in Proposition 2 for each of them. However, learning a model that is entirely fair with this approach will introduce an exponential number of constraints. Not only does this make the optimization more challenging, but listing all patterns may simply be infeasible.

4.2 Learning $\delta$ -fair Parameters

To address the aforementioned challenge of removing an exponential number of discrimination patterns, we propose an approach based on the cutting plane method. That is, we iterate between parameter learning and constraint extraction, gradually adding fairness constraints to the optimization. The parameter learning component is as described in the previous section, where we add the constraints of Proposition 2 for each discrimination pattern that has been extracted so far. For constraint extraction we use the top- $k$ * pattern miner presented in Section 3.2. At each iteration, we learn the maximum-likelihood parameters subject to fairness constraints, and find $k$ more patterns using the updated parameters to add to the set of constraints in the next iteration. This process is repeated until the pattern miner finds no more discrimination pattern.*

In the worst case, our algorithm may add exponentially many fairness constraints whilst solving multiple optimization problems. However, as we will later show empirically, we can learn a $\delta$ -fair model by explicitly enforcing only a small fraction of fairness constraints. The efficacy of our approach depends on strategically extracting patterns that are significant in the overall distribution. Here, we again use a ranking by discrimination or divergence score, which we also evaluate empirically.

4.3 Empirical Evaluation of $\delta$ -fair Learner

We will now evaluate our iterative algorithm for learning $\delta$ -fair naive Bayes models. We use the same datasets and hardware as in Section 3.3. To solve the signomial programs, we use GPkit, which finds local solutions to these problems using a convex optimization solver as its backend.777We use Mosek (www.mosek.com) as backend. Throughout our experiments, Laplace smoothing was used to avoid learning zero probabilities.

Q1. Can we learn a $\delta$ -fair model in a small number of iterations while only asserting a small number of fairness constraints?* We train a naive Bayes model on the COMPAS dataset subject to $\delta$ -fairness constraints. Fig. 3(a) shows how the iterative method converges to a $\delta$ -fair model, whose likelihood is indicated by the dotted line. Our approach converges to a fair model in a few iterations, including only a small fraction of the fairness constraints. In particular, adding only the most discriminating pattern as a constraint at each iteration learns an entirely $\delta$ -fair model with only three fairness constraints.888There are 2695 discrimination patterns w.r.t. unconstrained naive Bayes on COMPAS and $\delta=0.1$ . Moreover, Fig. 3(b) shows the number of remaining discrimination patterns after each iteration of learning with $k\!=\!1$ . Note that enforcing a single fairness constraint can eliminate a large number of remaining ones. Eventually, a few constraints subsume all discrimination patterns.*

We also evaluated our $\delta$ -fair learner on the other two datasets; see Appendix for plots. We observed that more than a million discrimination patterns that exist in the unconstrained maximum-likelihood models were eliminated using a few dozen to, even in the worst case, a few thousand fairness constraints. Furthermore, stricter fairness requirements (smaller $\delta$ ) tend to require more iterations, as would be expected. An interesting observation is that neither of the rankings consistently dominate the other in terms of the number of iterations to converge.

Q2. How does the quality of naive Bayes models from our fair learner compare to ones that make the sensitive attributes independent of the decision? and to the best model without fairness constraints?* A simple method to guarantee that a naive Bayes model is $\delta$ -fair is to make all sensitive variables independent from the target value. An obvious downside is the negative effect on the predictive power of the model. We compare the models learned by our approach with: (1) a maximum-likelihood model with no fairness constraints (unconstrained) and (2) a model in which the sensitive variables are independent of the decision variable, and the remaining parameters are learned using the max-likelihood criterion (independent). These models lie at two opposite ends of the spectrum of the trade-off between fairness and accuracy. The $\delta$ -fair model falls between these extremes, balancing approximate fairness and prediction power.*

We compare the log-likelihood of these models, shown in Table 2, as it captures the overall quality of a probabilistic classifier which can make predictions with partial observations. The $\delta$ -fair models achieve likelihoods that are much closer to those of the unconstrained models than the independent ones. This shows that it is possible to enforce the fairness constraints without a major reduction in model quality.

Q3. Do discrimination patterns still occur when learning naive Bayes models from fair data?* We first use the data repair algorithm proposed by ? (?) to remove discrimination from data, and learn a naive Bayes model from the repaired data. Table 3 shows the number of remaining discrimination patterns in such model. The results indicate that as long as preserving some degree of accuracy is in the objective, this method leaves lots of discrimination patterns, whereas our method removes all patterns.*

Q4. How does the performance of $\delta$ -fair naive Bayes classifier compare to existing work?**

Table 4 reports the 10-fold CV accuracy of our method ( $\delta$ -fair) compared to a max-likelihood naive Bayes model (unconstrained) and two other methods of learning fair classifiers: the two-naive-Bayes method (2NB) (?), and a naive Bayes model trained on discrimination-free data using the repair algorithm of ? (?) with $\lambda=1$ . Even though the notion of discrimination patterns was proposed for settings in which predictions are made with missing values, our method still outperforms other fair models in terms of accuracy, a measure better suited for predictions using fully-observed features. Moreover, our method also enforces a stronger definition of fairness than the two-naive-Bayes method which aims to achieve statistical parity, which is subsumed by the notion of discrimination patterns. It is also interesting to observe that our $\delta$ -fair NB models perform even better than unconstrained NB models for the Adult and German dataset. Hence, removing discrimination patterns does not necessarily impose an extra cost on the prediction task.

5 Related Work

Most prominent definitions of fairness in machine learning can be largely categorized into individual fairness and group fairness. Individual fairness is based on the intuition that similar individuals should be treated similarly. For instance, the Lipschitz condition (?) requires that the statistical distance between classifier outputs of two individuals are bounded by a task-specific distance between them. As hinted to in Section 2, our proposed notion of $\delta$ -fairness satisfies the Lipschitz condition if two individuals who differ only in the sensitive attributes are considered similar, thus bounding the difference between their outputs by $\delta$ . However, our definition cannot represent more nuanced similarity metrics that consider relationships between feature values.

Group fairness aims at achieving equality among populations differentiated by their sensitive attributes. An example of group fairness definition is statistical (demographic) parity, which states that a model is fair if the probability of getting a positive decision is equal between two groups defined by the sensitive attribute, i.e. $P(d|s)\!=\!P(d|\bar{s})$ where $d$ and $S$ are positive decision and sensitive variable, respectively. Approximate measures of statistical parity include CV-discrimination score (?): $P(d|s)\!-\!P(d|\bar{s})$ ; and disparate impact (or $p$ %-rule) (?; ?): $P(d|\bar{s})/P(d|s)$ . Our definition of $\delta$ -fairness is strictly stronger than requiring a small CV-discrimination score, as a violation of (approximate) statistical parity corresponds to a discrimination pattern with only the sensitive attribute (i.e. empty $\mathbf{y}$ ). Even though the $p$ %-rule was not explicitly discussed in this paper, our notion of discrimination pattern can be extended to require a small relative (instead of absolute) difference for partial feature observations (see Appendix for details). However, as a discrimination pattern conceptually represents an unfair treatment of an individual based on observing some sensitive attributes, using relative difference should be motivated by an application where the level of unfairness depends on the individual’s classification score.

Moreover, statistical parity is inadequate in detecting bias for subgroups or individuals. We resolve such issue by eliminating discrimination patterns for all subgroups that can be expressed as assignments to subsets of features. In fact, we satisfy approximate statistical parity for any subgroup defined over the set of sensitive attributes, as any subgroup can be expressed as a union of joint assignments to the sensitive features, each of which has a bounded discrimination score. ? (?) showed that auditing fairness at this arbitrary subgroup level (i.e. detecting fairness gerrymandering) is computationally hard.

Other notions of group fairness include equalized true positive rates (equality of opportunity), false positive rates, or both (equalized odds (?)) among groups defined by the sensitive attributes. These definitions are “oblivious” to features other than the sensitive attribute, and focus on equalizing measures of classifier performance assuming all features are always observed. On the other hand, our method aims to ensure fairness when classifications may be made with missing features. Moreover, our method still applies in decision making scenarios where a true label is not well defined or hard to observe.

Our approach differs from causal approaches to fairness (?; ?; ?) which are more concerned with the causal mechanism of the real world that generated a potentially unfair decision, whereas we study the effect of sensitive information on a known classifier.

There exist several approaches to learning fair naive Bayes models. First, one may modify the data to achieve fairness and use standard algorithms to learn a classifier from the modified data. For instance, ? (?) proposed to change the labels for features near the decision boundary to achieve statistical parity, while the repair algorithm of ? (?) changes the non-sensitive attributes to reduce their correlation with the sensitive attribute. Although these methods have the flexibility of learning different models, we have shown empirically that a model learned from a fair data may still exhibit discrimination patterns. On the other hand, ? (?) proposed three different Bayesian network structures modified from a naive Bayes network in order to enforce statistical parity directly during learning. We have shown in the previous section that our method achieves better accuracy than their two-naive-Bayes method (which was found to be the best of three methods), while ensuring a stricter definition of fairness. Lastly, one may add a regularizer during learning (?; ?), whereas we formulated to problem as constrained optimization, an approach often used to ensure fairness in other models (?; ?).

6 Discussion and Conclusion

In this paper we introduced a novel definition of fair probability distribution in terms of discrimination patterns which considers exponentially many (partial) observations of features. We have also presented algorithms to search for discrimination patterns in naive Bayes networks and to learn a high-quality fair naive Bayes classifier from data. We empirically demonstrated the efficiency of our search algorithm and the ability to eliminate exponentially many discrimination patterns by iteratively removing a small fraction at a time.

We have shown that our approach of fair distribution implies group fairness such as statistical parity. However, ensuring group fairness in general is always with respect to a distribution and is only valid under the assumption that this distribution is truthful. While our approach guarantees some level of group fairness of naive Bayes classifiers, this is only true if the naive Bayes assumption holds. That is, the group fairness guarantees do not extend to using the classifier on an arbitrary population.

There is always a tension between three criteria of a probabilistic model: its fidelity, fairness, and tractability. Our approach aims to strike a balance between them by giving up some likelihood to be tractable (naive Bayes assumption) and more fair. There are certainly other valid approaches: learning a more general graphical model to increase fairness and truthfulness, which would in general make it intractable, or making the model less fair in order to make it more truthful and tractable.

Lastly, real-world algorithmic fairness problems are only solved by domain experts understanding the process that generated the data, its inherent biases, and which modeling assumptions are appropriate. Our algorithm is only a tool to assist such experts in learning fair distributions: it can provide the domain expert with discrimination patterns, who can then decide which patterns need to be eliminated.

Acknowledgments

This work is partially supported by NSF grants #IIS-1633857, #CCF-1837129, DARPA XAI grant #N66001-17-2-4032, NEC Research, and gifts from Intel and Facebook Research. Golnoosh Farnadi and Behrouz Babaki are supported by postdoctoral scholarships from IVADO through the Canada First Research Excellence Fund (CFREF) grant.

Appendix A Degree of Discrimination Bound

A.1 Proof of Proposition 1

We first derive how $\widetilde{\Delta}$ represents the degree of discrimination $\Delta$ for some pattern $\mathbf{x}\mathbf{y}$ .

[TABLE]

Clearly, if $l\leq\gamma\leq u$ then $\min_{l\leq\gamma\leq u}\widetilde{\Delta}(\alpha,\beta,\gamma)\leq\widetilde{\Delta}(\alpha,\beta,\gamma)\leq\max_{l\leq\gamma\leq u}\widetilde{\Delta}(\alpha,\beta,\gamma).$ Therefore, if $l\leq P(d\,|\,\mathbf{y}\mathbf{y}^{\prime})\leq u$ , then the following holds for any $\mathbf{x}$ :

[TABLE]

Next, suppose $\mathbf{x}_{u}^{\prime}=\operatorname*{arg\,max}_{\mathbf{x}^{\prime}}P(d\,|\,\mathbf{x}\mathbf{x}^{\prime})$ and $\mathbf{x}_{l}^{\prime}=\operatorname*{arg\,min}_{\mathbf{x}^{\prime}}P(d\,|\,\mathbf{x}\mathbf{x}^{\prime})$ . Then from Lemma 1, we also have that $\mathbf{x}_{u}^{\prime}=\operatorname*{arg\,max}_{\mathbf{x}^{\prime}}P(d\,|\,\mathbf{x}\mathbf{x}^{\prime}\mathbf{y}\mathbf{y}^{\prime})$ and $\mathbf{x}_{l}^{\prime}=\operatorname*{arg\,min}_{\mathbf{x}^{\prime}}P(d\,|\,\mathbf{x}\mathbf{x}^{\prime}\mathbf{y}\mathbf{y}^{\prime})$ for any $\mathbf{y}\mathbf{y}^{\prime}$ . Therefore,

[TABLE]

A.2 Computing the Discrimination Bound

If $\alpha=P(\mathbf{x}\,|\,d)=0$ and $\beta=P(\mathbf{x}\,|\,\overline{d})=0$ , then the probability of $\mathbf{x}$ is zero and thus $P(d\,|\,\mathbf{x}\mathbf{y})$ is ill-defined. Therefore, we will assume that either $\alpha$ or $\beta$ is nonzero.

Let us write $\widetilde{\Delta}_{\alpha,\beta}(\gamma)=\widetilde{\Delta}(\alpha,\beta,\gamma)$ to denote the function restricted to fixed $\alpha$ and $\beta$ . If $\alpha=\beta$ , then $\widetilde{\Delta}_{\alpha,\beta}=0$ . Also, $\widetilde{\Delta}_{0,\beta}(\gamma)=-\gamma$ and $\widetilde{\Delta}_{\alpha,0}(\gamma)=1-\gamma$ . Thus, in the following analysis we assume $\alpha$ and $\beta$ are non-zero and distinct.

If $0<\alpha\leq\beta\leq 1$ , $\widetilde{\Delta}_{\alpha,\beta}$ is negative and convex in $\gamma$ within $0\leq\gamma\leq 1$ . On the other hand, if $0<\beta\leq\alpha\leq 1$ , then $\widetilde{\Delta}_{\alpha,\beta,\gamma}$ is positive and concave. This can quickly be checked using the following derivatives.

[TABLE]

Furthermore, the sign of the derivative at $\gamma=0$ is different from that at $\gamma=1$ , and thus there must exist a unique optimum in $0\leq\gamma\leq 1$ .

Solving for $\frac{d}{d\gamma}\widetilde{\Delta}_{\alpha,\beta}(\gamma)=0$ , we get $\gamma=\frac{\beta\pm\sqrt{\alpha\beta}}{\beta-\alpha}$ . The solution corresponding to the feasible space $0\leq\gamma\leq 1$ is: $\gamma_{\text{opt}}=\frac{\beta-\sqrt{\alpha\beta}}{\beta-\alpha}.$ The optimal value is derived as the following.

[TABLE]

Next, suppose that the feasible space is restricted to $l\leq\gamma\leq u$ . Then the optimal solution is: $\gamma_{\text{opt}}$ if $l\leq\gamma_{\text{opt}}\leq u$ ; $l$ if $\gamma_{\text{opt}}<l$ ; and $u$ if $\gamma_{\text{opt}}>u$ .

A.3 Proof of Lemma 1

Now we prove that we can maximize the posterior decision probability by maximizing each variable independently. It suffices to prove that for a single variable $V$ and all evidence $\mathbf{w}$ , $\operatorname*{arg\,max}_{v}P(d\,|\,v\mathbf{w})=\operatorname*{arg\,max}_{v}\frac{P(v\,|\,d)}{P(v\,|\,\overline{d})}$ . We first express $P(d\,|\,v\mathbf{w})$ as the following:

[TABLE]

Then clearly,

[TABLE]

Appendix B Divergence Score

B.1 Derivation of Equation 2

We want to find the closed form solution of the optimization problem in Equation 1. Because $P$ and $Q$ differs only in two assignments, we can write the KL divergence as follows:

[TABLE]

Let $r$ be the change in probability of $d\mathbf{x}\mathbf{y}$ . That is, $r=Q(d\mathbf{x}\mathbf{y})-P(d\mathbf{x}\mathbf{y})$ . For $Q$ to be a valid probability distribution, we must have $Q(d\mathbf{x}\mathbf{y})+Q(\overline{d}\mathbf{x}\mathbf{y})=P(\mathbf{x}\mathbf{y})$ . Then we have $Q(d\mathbf{x}\mathbf{y})=P(d\mathbf{x}\mathbf{y})+r$ , and $Q(\overline{d}\mathbf{x}\mathbf{y})=P(\mathbf{x}\mathbf{y})-Q(d\mathbf{x}\mathbf{y})=P(\overline{d}\mathbf{x}\mathbf{y})-r$ . We can then express the KL divergence between $P$ and $Q$ as a function of $P$ and $r$ :

[TABLE]

Moreover, the discrimination score of pattern $\mathbf{x}\mathbf{y}$ w.r.t $Q$ can be expressed using $P$ and $r$ as the following:

[TABLE]

The heuristic $\operatorname{Div}_{P,d,\delta}(\mathbf{x},\mathbf{y})$ is then written using $r$ as follows:

[TABLE]

The objective function $g_{P,d,\mathbf{x},\mathbf{y}}$ is convex in $r$ with its unconstrained global minimum at $r=0$ . Note that this is a feasible point if and only if $\left\lvert\Delta_{P,d}(\mathbf{x},\mathbf{y})\right\rvert\leq\delta$ ; in other words, when the pattern $\mathbf{x}\mathbf{y}$ is already fair. Otherwise, the optimum must be either of the extreme points of the feasible space, whichever is closer to [math]. The extreme points for the first set of inequalities are:

[TABLE]

If $\Delta_{P,d}(\mathbf{x},\mathbf{y})>\delta$ , then $r_{2}\leq r_{1}<0$ . In such case, $g(r_{2})\geq g(r_{1})$ and $-P(d\mathbf{x}\mathbf{y})\leq r_{1}\leq P(\overline{d}\mathbf{x}\mathbf{y})$ as shown below:

[TABLE]

Similarly, if $\Delta_{P,d}(\mathbf{x},\mathbf{y})<-\delta$ , then $r_{1}\geq r_{2}>0$ . Also, $g(r_{1})\geq g(r_{2})$ and $-P(d\mathbf{x}\mathbf{y})\leq r_{2}\leq P(\overline{d}\mathbf{x}\mathbf{y})$ as shown below:

[TABLE]

Hence, the optimal solution $r^{\star}$ is

[TABLE]

and the divergence score is $\operatorname{Div}_{P,d,\delta}(\mathbf{x},\mathbf{y})=g_{P,d,\mathbf{x},\mathbf{y}}(r^{\star})$ .

B.2 Upper Bounds on Divergence Score

Here we present two upper bounds on the divergence score for pruning the search tree. The first bound uses the observation that the hypothetical distribution $Q$ with $\Delta_{Q,d}(\mathbf{x},\mathbf{y})=0$ is always a feasible hypothetical fair distribution.

Proposition 3.

Let $P$ be a Naive Bayes distribution over $D\cup\mathbf{Z}$ , and let $\mathbf{x}$ and $\mathbf{y}$ be joint assignments to $\mathbf{X}\subseteq\mathbf{S}$ and $\mathbf{Y}\subseteq\mathbf{Z}\setminus\mathbf{X}$ . For all possible valid extensions $\mathbf{x}^{\prime}$ and $\mathbf{y}^{\prime}$ , the following holds:

[TABLE]

Proof.

Consider the following point:

[TABLE]

First, we show that above $r_{0}$ is always a feasible point in Problem 3:

[TABLE]

Then the divergence score for any pattern must be smaller than $g_{P,d,\mathbf{x},\mathbf{y}}(r_{0})$ :

[TABLE]

Here, we use $\overline{\mathbf{x}}$ to mean that $\mathbf{x}$ does not hold. In other words,

[TABLE]

We can then use this to bound the divergence score any pattern extended from $\mathbf{x}\mathbf{y}$ :

[TABLE]

∎

We can also bound the divergence score using the maximum and minimum possible discrimination scores shown in Proposition 1, in place of the current pattern’s discrimination. Let us denote the bounds for discrimination score as follows:

[TABLE]

Proposition 4.

Let $P$ be a Naive Bayes distribution over $D\cup\mathbf{Z}$ , and let $\mathbf{x}$ and $\mathbf{y}$ be joint assignments to $\mathbf{X}\subseteq\mathbf{S}$ and $\mathbf{Y}\subseteq\mathbf{Z}\setminus\mathbf{X}$ . For all possible valid extensions $\mathbf{x}^{\prime}$ and $\mathbf{y}^{\prime}$ , $\operatorname{Div}_{P,d,\delta}(\mathbf{x}\mathbf{x}^{\prime},\mathbf{y}\mathbf{y}^{\prime})\leq\max\left(g_{P,d,\mathbf{x}\mathbf{x}^{\prime},\mathbf{y}\mathbf{y}^{\prime}}(r_{u}),g_{P,d,\mathbf{x}\mathbf{x}^{\prime},\mathbf{y}\mathbf{y}^{\prime}}(r_{l})\right)$ where

[TABLE]

Proof.

The proof proceeds by case analysis on the discrimination score of extended patterns $\mathbf{x}\mathbf{x}^{\prime}\mathbf{y}\mathbf{y}^{\prime}$ .

First, if $\left\lvert\Delta(\mathbf{x}\mathbf{x}^{\prime},\mathbf{y}\mathbf{y}^{\prime})\right\rvert\leq\delta$ , $\operatorname{Div}_{P,d,\delta}(\mathbf{x}\mathbf{x}^{\prime},\mathbf{y}\mathbf{y}^{\prime})=0$ which is the global minimum, and thus is smaller than both $g(r_{u})$ and $g(r_{l})$ .

Next, suppose $\Delta(\mathbf{x}\mathbf{x}^{\prime},\mathbf{y}\mathbf{y}^{\prime})>\delta$ . Then from Proposition 1,

[TABLE]

As $g$ is convex with its minimum at 0, we can conclude $\operatorname{Div}_{P,d,\delta}(\mathbf{x}\mathbf{x}^{\prime},\mathbf{y}\mathbf{y}^{\prime})=g(r^{\star})\leq g(r_{u})$ .

Finally, if $\Delta(\mathbf{x}\mathbf{x}^{\prime},\mathbf{y}\mathbf{y}^{\prime})<-\delta$ , we have

[TABLE]

Similarly, this implies $\operatorname{Div}_{P,d,\delta}(\mathbf{x}\mathbf{x}^{\prime},\mathbf{y}\mathbf{y}^{\prime})=g(r^{\star})\leq g(r_{l})$ . Because the divergence score is always smaller than either $g(r_{u})$ or $g(r_{l})$ , it must be smaller than $\max(g(r_{u}),g(r_{l}))$ . ∎

Lastly, we show how to efficiently compute an upper bound on $g_{P,d,\mathbf{x}\mathbf{x}^{\prime},\mathbf{y}\mathbf{y}^{\prime}}(r_{u})$ $g_{P,d,\mathbf{x}\mathbf{x}^{\prime},\mathbf{y}\mathbf{y}^{\prime}}(r_{l})$ from Proposition 4 for all patterns extended from $\mathbf{x}\mathbf{y}$ . This is necessary for pruning during the search for discrimination patterns with high divergence scores. First, note that $r_{u}$ and $r_{l}$ can be expressed as

[TABLE]

where $c=\delta-\overline{\Delta}(\mathbf{x},\mathbf{y})$ for $r_{u}$ and $c=-\delta-\underline{\Delta}(\mathbf{x},\mathbf{y})$ for $r_{l}$ . Hence, it suffices to derive the following bound.

[TABLE]

Appendix C Proof of Proposition 2

The probability values of positive decision in terms of naive Bayes parameters $\theta$ are as follows:

[TABLE]

For simplicity of notation, let us write:

[TABLE]

Then the degree of discrimination is $\Delta_{P_{\theta},d}(\mathbf{x},\mathbf{y})=P_{\theta}(d\,|\,\mathbf{x}\mathbf{y})-P_{\theta}(d\,|\,\mathbf{y})=\frac{1}{1+r_{\mathbf{x}}r_{\mathbf{y}}}-\frac{1}{1+r_{\mathbf{y}}}$ . Now we express the fairness constraint $\left\lvert\Delta_{P_{\theta},d}(\mathbf{x},\mathbf{y})\right\rvert\leq\delta$ as the following two inequalities:

[TABLE]

After simplifying,

[TABLE]

We further express this as the following two signomial inequality constraints:

[TABLE]

Note that $r_{\mathbf{x}}$ and $r_{\mathbf{y}}$ according to Equation 5 are monomials of $\theta$ , and thus above constraints are also signomial with respect to the optimization variables $\theta$ . ∎

Appendix D Additional Experiments

Here we present the full set of experiments referred to in Q1 of Section 4.3.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Calders and Verwer 2010] Calders, T., and Verwer, S. 2010. Three naive bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery 21(2):277–292.
2[Chouldechova 2017] Chouldechova, A. 2017. Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big data 5(2):153–163.
3[Darwiche 2009] Darwiche, A. 2009. Modeling and reasoning with Bayesian networks . Cambridge University Press.
4[Datta, Tschantz, and Datta 2015] Datta, A.; Tschantz, M. C.; and Datta, A. 2015. Automated experiments on ad privacy settings. Proceedings on Privacy Enhancing Technologies 2015(1):92–112.
5[Dechter 2013] Dechter, R. 2013. Reasoning with probabilistic and deterministic graphical models: Exact algorithms. Synthesis Lectures on Artificial Intelligence and Machine Learning 7(3):1–191.
6[Dwork et al . 2012] Dwork, C.; Hardt, M.; Pitassi, T.; Reingold, O.; and Zemel, R. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference , 214–226. ACM.
7[Ecker 1980] Ecker, J. G. 1980. Geometric programming: methods, computations and applications. SIAM review 22(3):338–362.
8[Farnadi, Babaki, and Getoor 2018] Farnadi, G.; Babaki, B.; and Getoor, L. 2018. Fairness in relational domains. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society , 108–114. ACM.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Learning Fair Naive Bayes Classifiers by

Abstract

1 Introduction

2 Problem Formalization

Definition 1**.**

Definition 2**.**

Definition 3**.**

3 Discovering Discrimination Patterns and Verifying δ\deltaδ-fairness

3.1 Searching for Discrimination Patterns

Proposition 1**.**

Lemma 1**.**

3.2 Searching for Top-kkk Ranked Patterns

Definition 4**.**

3.3 Empirical Evaluation of Discrimination Pattern Miner

4 Learning Fair Naive Bayes Classifiers

4.1 Parameter Learning with Fairness Constraints

Proposition 2**.**

4.2 Learning δ\deltaδ-fair Parameters

4.3 Empirical Evaluation of δ\deltaδ-fair Learner

5 Related Work

6 Discussion and Conclusion

Acknowledgments

Appendix A Degree of Discrimination Bound

A.1 Proof of Proposition 1

A.2 Computing the Discrimination Bound

A.3 Proof of Lemma 1

Appendix B Divergence Score

B.1 Derivation of Equation 2

B.2 Upper Bounds on Divergence Score

Proposition 3**.**

Proof.

Proposition 4**.**

Proof.

Appendix C Proof of Proposition 2

Appendix D Additional Experiments

Definition 1.

Definition 2.

Definition 3.

3 Discovering Discrimination Patterns and Verifying $\delta$ -fairness

Proposition 1.

Lemma 1.

3.2 Searching for Top- $k$ Ranked Patterns

Definition 4.

Proposition 2.

4.2 Learning $\delta$ -fair Parameters

4.3 Empirical Evaluation of $\delta$ -fair Learner

Proposition 3.

Proposition 4.