Variational Fair Clustering

Imtiaz Masud Ziko; Eric Granger; Jing Yuan; Ismail Ben Ayed

arXiv:1906.08207·cs.LG·December 7, 2020

Variational Fair Clustering

Imtiaz Masud Ziko, Eric Granger, Jing Yuan, Ismail Ben Ayed

PDF

2 Repos 1 Video

TL;DR

This paper introduces a scalable variational framework for fair clustering that balances fairness and clustering objectives, outperforming existing spectral methods in efficiency and flexibility.

Contribution

It presents a novel variational approach with a tight upper bound for fair clustering, enabling scalable, distributed optimization without eigenvalue decomposition.

Findings

01

Achieves competitive fairness and clustering quality on benchmarks.

02

Offers scalable, distributed optimization suitable for large datasets.

03

Does not require spectral eigenvalue computations.

Abstract

We propose a general variational framework of fair clustering, which integrates an original Kullback-Leibler (KL) fairness term with a large class of clustering objectives, including prototype or graph based. Fundamentally different from the existing combinatorial and spectral solutions, our variational multi-term approach enables to control the trade-off levels between the fairness and clustering objectives. We derive a general tight upper bound based on a concave-convex decomposition of our fairness term, its Lipschitz-gradient property and the Pinsker's inequality. Our tight upper bound can be jointly optimized with various clustering objectives, while yielding a scalable solution, with convergence guarantee. Interestingly, at each iteration, it performs an independent update for each assignment variable. Therefore, it can be easily distributed for large-scale datasets. This…

Tables3

Table 1. Table 2: Comparison of the proposed Fair K-median to (Backurs et al. 2019 ) .

Datasets	Fair K-median
Datasets	Objective		fairness error / balance
	Backurs et al. 2019	Ours	Backurs et al. 2019	Ours
Synthetic ( $N = 400, J = 2, λ = 10$ )	299.859	292.4	0.00/1.00	0.00/1.00
Synthetic-unequal ( $N = 400, J = 2, λ = 10$ )	185.47	174.81	0.77/0.21	0.00/0.33
Adult ( $N = 32, 561, J = 2,, λ = 9000$ )	19330.93	18467.75	0.27/0.31	0.01/0.43
Bank ( $N = 41, 108, J = 3, λ = 9000$ )	N/A	19527.08	N/A	0.02/0.18
Census II ( $N = 2, 458, 285, J = 2, λ = 500000$ )	2385997.92	1754109.46	0.41/0.38	0.02/0.78

Table 2. Table 3: Comparison of the proposed Fair K-means to (Bera et al. 2019 ) .

Datasets	Fair K-means
	Objective		fairness error / balance
	Bera et al. 2019	Ours	Bera et al. 2019	Ours
Synthetic ( $N = 400, J = 2, λ = 10$ )	758.43	207.80	0.00 / 1.00	0.00 / 1.00
Synthetic-unequal ( $N = 400, J = 2, λ = 10$ )	180.00	159.75	0.00 / 0.33	0.00 / 0.33
Adult ( $N = 32, 561, J = 2, λ = 9000$ )	10913.84	9984.01	0.018 / 0.41	0.018 / 0.41
Bank ( $N = 41, 108, J = 3, λ = 6000$ )	11331.51	9392.20	0.03 / 0.16	0.05 / 0.17
Census II ( $N = 2, 458, 285, J = 2, λ = 500000$ )	1355457.02	1018996.53	0.07/0.77	0.02/0.78

Table 3. Table 4: Comparison of the proposed Fair NCut to (Kleindessner et al. 2019 ) .

Datasets	Fair NCUT
	Objective		fairness error / balance
	Kleindessner et al. 2019	Ours	Kleindessner et al. 2019	Ours
Synthetic ( $N = 400, J = 2, λ = 10$ )	0.0	0.0	0.00/1.00	0.0/1.00
Synthetic-unequal ( $N = 400, J = 2, λ = 10$ )	0.03	0.06	0.00/0.33	0.00/0.33
Adult ( $N = 32, 561, J = 2, λ = 10$ )	0.47	0.74	0.06/0.32	0.08/0.30
Bank ( $N = 41, 108, J = 3, λ = 40$ )	N/A	0.58	N/A	0.39/0.14
Census II ( $N = 2, 458, 285, J = 2, λ = 100$ )	N/A	0.52	N/A	0.41/0.43

Equations80

\mbox ba l an ce (S_{k}) = j \neq = j^{^{'}} min \frac{V _{j}^{t} S _{k}}{V _{j^{^{'}}}^{t} S _{k}} \in [0, 1]

\mbox ba l an ce (S_{k}) = j \neq = j^{^{'}} min \frac{V _{j}^{t} S _{k}}{V _{j^{^{'}}}^{t} S _{k}} \in [0, 1]

S min F (S) + λ k \sum D_{\mbox K L} (U ∣∣ P_{k}) s.t. s_{p} \in \nabla_{K} \forall p

S min F (S) + λ k \sum D_{\mbox K L} (U ∣∣ P_{k}) s.t. s_{p} \in \nabla_{K} \forall p

P_{k} = [P (j ∣ k)]; P (j ∣ k) = \frac{V _{j}^{t} S _{k}}{1 ^{t} S _{k}} \forall j,

P_{k} = [P (j ∣ k)]; P (j ∣ k) = \frac{V _{j}^{t} S _{k}}{1 ^{t} S _{k}} \forall j,

E (S) = clustering F (S) + λ fairness k \sum j \sum - μ_{j} lo g P (j ∣ k)

E (S) = clustering F (S) + λ fairness k \sum j \sum - μ_{j} lo g P (j ∣ k)

- μ_{j} lo g P (j ∣ k) = concave μ_{j} lo g 1^{t} S_{k} convex - μ_{j} lo g V_{j}^{t} S_{k}

- μ_{j} lo g P (j ∣ k) = concave μ_{j} lo g 1^{t} S_{k} convex - μ_{j} lo g V_{j}^{t} S_{k}

E (S)

E (S)

E (S^{i})

S^{i + 1} = ar g S min A_{i} (S)

S^{i + 1} = ar g S min A_{i} (S)

E (S^{i + 1}) \leq A_{i} (S^{i + 1}) \leq A_{i} (S^{i}) = E (S^{i})

E (S^{i + 1}) \leq A_{i} (S^{i + 1}) \leq A_{i} (S^{i}) = E (S^{i})

G_{i} (S)

G_{i} (S)

b_{p}^{i} = [b_{p, 1}^{i}, \dots, b_{p, K}^{i}]

H_{i} (S) = p = 1 \sum N s_{p}^{t} a_{p}^{i}

H_{i} (S) = p = 1 \sum N s_{p}^{t} a_{p}^{i}

A_{i} (S) = p = 1 \sum N s_{p}^{t} (a_{p}^{i} + λ b_{p}^{i} + lo g s_{p} - lo g s_{p}^{i})

A_{i} (S) = p = 1 \sum N s_{p}^{t} (a_{p}^{i} + λ b_{p}^{i} + lo g s_{p} - lo g s_{p}^{i})

s_{p} \in \nabla_{K} min s_{p}^{t} (a_{p}^{i} + λ b_{p}^{i} + lo g s_{p} - lo g s_{p}^{i}), \forall p

s_{p} \in \nabla_{K} min s_{p}^{t} (a_{p}^{i} + λ b_{p}^{i} + lo g s_{p} - lo g s_{p}^{i}), \forall p

s_{p}^{i + 1} = \frac{s _{p}^{i} exp ( - ( a _{p}^{i} + λ b _{p}^{i} ))}{1 ^{t} [ s _{p}^{i} exp ( - ( a _{p}^{i} + λ b _{p}^{i} ))]} \forall p

s_{p}^{i + 1} = \frac{s _{p}^{i} exp ( - ( a _{p}^{i} + λ b _{p}^{i} ))}{1 ^{t} [ s _{p}^{i} exp ( - ( a _{p}^{i} + λ b _{p}^{i} ))]} \forall p

E (S) = clustering F (S) + λ fairness k \sum j \sum - μ_{j} lo g P (j ∣ k)

E (S) = clustering F (S) + λ fairness k \sum j \sum - μ_{j} lo g P (j ∣ k)

G_{i} (S)

G_{i} (S)

b_{p}^{i} = [b_{p, 1}^{i}, \dots, b_{p, K}^{i}]

- μ_{j} lo g P (j ∣ k)

- μ_{j} lo g P (j ∣ k)

g_{1} (S_{k})

g_{1} (S_{k})

\tilde{g}_{1} (S)

\tilde{g}_{1} (S)

g_{2} (S_{k})

g_{2} (S_{k})

\tilde{g}_{2} (S)

\tilde{g}_{2} (S)

\nabla^{2} (g_{2} (S_{k}^{i})) = \frac{μ _{j}}{( V _{j}^{t} S _{k}^{i} ) ^{2}} V_{j} V_{j}^{t} .

\nabla^{2} (g_{2} (S_{k}^{i})) = \frac{μ _{j}}{( V _{j}^{t} S _{k}^{i} ) ^{2}} V_{j} V_{j}^{t} .

\tilde{g}_{2} (S)

\tilde{g}_{2} (S)

G_{i} (S)

G_{i} (S)

b_{p}^{i} = [b_{p, 1}^{i}, \dots, b_{p, K}^{i}]

f (x) \leq f (y) + [\nabla f (y)]^{t} (x - y) + L .∥ x - y ∥^{2}

f (x) \leq f (y) + [\nabla f (y)]^{t} (x - y) + L .∥ x - y ∥^{2}

f (x)

f (x)

D_{\mbox K L} (x ∣∣ y) \geq \frac{1}{2} ∥ x - y ∥^{2}

D_{\mbox K L} (x ∣∣ y) \geq \frac{1}{2} ∥ x - y ∥^{2}

D_{\mbox K L} (x ∣∣ y) = k \sum x_{k} lo g \frac{x _{k}}{y _{k}}

D_{\mbox K L} (x ∣∣ y) = k \sum x_{k} lo g \frac{x _{k}}{y _{k}}

q_{o} (x) \geq q_{o} (y) + [\nabla q_{o} (y)]^{t} (x - y) + \frac{1}{2} ∥ x - y ∥^{2}

q_{o} (x) \geq q_{o} (y) + [\nabla q_{o} (y)]^{t} (x - y) + \frac{1}{2} ∥ x - y ∥^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Variational Fair Clustering· underline

Full text

Variational Fair Clustering

Imtiaz Masud Ziko1, Jing Yuan2, Eric Granger1 and Ismail Ben Ayed1

Abstract

We propose a general variational framework of fair clustering, which integrates an original Kullback-Leibler (KL) fairness term with a large class of clustering objectives, including prototype or graph based. Fundamentally different from the existing combinatorial and spectral solutions, our variational multi-term approach enables to control the trade-off levels between the fairness and clustering objectives. We derive a general tight upper bound based on a concave-convex decomposition of our fairness term, its Lipschitz-gradient property and the Pinsker’s inequality. Our tight upper bound can be jointly optimized with various clustering objectives, while yielding a scalable solution, with convergence guarantee. Interestingly, at each iteration, it performs an independent update for each assignment variable. Therefore, it can be easily distributed for large-scale datasets. This scalability is important as it enables to explore different trade-off levels between the fairness and clustering objectives. Unlike spectral relaxation, our formulation does not require computing its eigenvalue decomposition. We report comprehensive evaluations and comparisons with state-of-the-art methods over various fair-clustering benchmarks, which show that our variational formulation can yield highly competitive solutions in terms of fairness and clustering objectives111Code is available at: https://github.com/imtiazziko/Variational-Fair-Clustering.

Introduction

Machine learning models are impacting our daily life, for instance, in marketing, finance, education, and even in sentencing recommendations (Kleinberg et al. 2017). However, these models may exhibit biases towards specific demographic groups due to, for instance, the biases that exist within the data. For example, a higher level of face recognition accuracy may be found with white males (Buolamwini and Gebru 2018), and a high probability of recidivism tends to be incorrectly predicted for low-risk African-Americans (Julia et al. 2016). These biases have recently triggered substantial interest in designing fair algorithms for the supervised learning setting (Hardt, Price, and Srebro 2016; Zafar et al. 2017; Donini et al. 2018). Also, very recently, the community started to investigate fairness constraints in unsupervised learning (Chierichetti et al. 2017; Kleindessner et al. 2019; Backurs et al. 2019; Samadi et al. 2018; Celis et al. 2018). Specifically, Chierichetti et al. (Chierichetti et al. 2017) pioneered the concept of fair clustering. The problem consists of embedding fairness constraints that encourage clusters to have balanced demographic groups pertaining to some sensitive attributes (e.g., sex, gender, race, etc.), so as to counteract any form of data-inherent bias.

Assume that we are given $N$ data points to be assigned to a set of $K$ clusters, and let $S_{k}\in\{0,1\}^{N}$ denotes a binary indicator vector whose components take value $1$ when the point is within cluster $k$ , and [math] otherwise. Also suppose that the data contains $J$ different demographic groups, with $V_{j}\in\{0,1\}^{N}$ denoting a binary indicator vector of demographic group $j$ . The authors of (Chierichetti et al. 2017; Kleindessner et al. 2019) suggested to evaluate fairness in terms of cluster-balance measures, which take the following form:

[TABLE]

The higher this measure, the fairer the cluster. The overall clustering balance is defined by the minimum of Eq. (1) over $k$ . This notion of fairness in clusters has recently given rise to a new line of research that was introduced, mostly, for prototype-based clustering, e.g., K-center, K-median and K-means (Chierichetti et al. 2017; Backurs et al. 2019; Schmidt, Schwiegelshohn, and Sohler 2018; Bera et al. 2019). Also, very recently, fairness has been investigated in the context of spectral graph clustering (Kleindessner et al. 2019). The general problem raises several interesting questions. How to embed fairness in popular clustering objectives? Can we control the trade-off between some “acceptable” fairness level (or tolerance) and the quality of the clustering objective? What is the cost of fairness with respect to the clustering objective and computational complexity?

Chierichetti et al. (Chierichetti et al. 2017) investigated combinatorial approximation algorithms for maximizing the the fairness measures in Eq. (1), for K-center and K-median clustering, and for binary demographic groups ( $J=2$ ). They compute fairlets, which are groups of points that are fair, and can not be split further into more subsets that are also fair. Then, they consider each fairlet as a data point, and cluster them with approximate K-center or K-median algorithms. Unfortunately, as reported in the experiments in (Chierichetti et al. 2017), obtaining fair solutions with these fairlets-based algorithms comes at the price of a substantial increase in the clustering objectives. Also, the cost for finding fairlets with perfect matching is quadratic w.r.t the number of data points, a complexity that increases for more than two demographic groups. Several combinatorial solutions followed-up on the work in (Chierichetti et al. 2017) to reduce this complexity. For instance, Backurs et al. (Backurs et al. 2019) proposed a solution to make the fairlet decomposition in (Chierichetti et al. 2017) scalable for $J=2$ , by embedding the input points in a tree metric. Rösner and Schmidt (Rösner and Schmidt 2018) designed a 14-approximate algorithm for fair K-center. (Schmidt, Schwiegelshohn, and Sohler 2018; Huang, Jiang, and Vishnoi 2019) proposed fair K-means/K-median based on coreset – a reduced proxy set for the full dataset. Bera et al. (Bera et al. 2019) provided a bi-criteria approximation algorithm for fair prototype-based clustering, enabling multiple groups ( $J>2$ ). It is worth noting that, for large-scale data sets, (Chierichetti et al. 2017; Rösner and Schmidt 2018; Bera et al. 2019) sub-sample the inputs to mitigate the quadratic complexity w.r.t $N$ . More importantly, the combinatorial algorithms discussed above are tailored for specific prototype-based objectives. For instance, they are not applicable to the very popular graph-clustering objectives, e.g., Ratio Cut or Normalized Cut (Von Luxburg 2007), which limits applicability in a breadth of graph problems, in which data takes the form of pairwise affinities.

Kleindessner et al. (Kleindessner et al. 2019) integrated fairness into graph-clustering objectives. They embedded linear constraints on the assignment matrix in spectral relaxation. Then, they solved a constrained trace optimization via finding the $K$ smallest eigenvalues of some transformed Laplacian matrix. However, it is well-known that spectral relaxation has heavy time and memory loads since it requires storing an $N\times N$ affinity matrix and computing its eigenvalue decomposition – the complexity is cubic w.r.t $N$ for a straightforward implementation, and super-quadratic for fast implementations (Tian et al. 2014). In the general context of spectral relaxation and graph partitioning, issues related to computational scalability for large-scale problems is driving an active line of recent work (Shaham et al. 2018; Ziko, Granger, and Ayed 2018; Vladymyrov and Carreira-Perpiñán 2016).

The existing fair clustering algorithms, such as the combinatorial or spectral solutions discussed above, do not have mechanisms that control the trade-off levels between the fairness and clustering objectives. Also, they are tailored either to prototype-based (Backurs et al. 2019; Bera et al. 2019; Chierichetti et al. 2017; Schmidt, Schwiegelshohn, and Sohler 2018) or graph-based objectives (Kleindessner et al. 2019). Finally, for a breadth of problems of wide interest, such as pairwise graph data, the computation and memory loads may become an issue for large-scale data sets.

Contributions: We propose a general, variational and bound-optimization framework of fair clustering, which integrates an original Kullback-Leibler (KL) fairness term with a large class of clustering objectives, including both prototype-based (e.g., K-means/K-median) and graph-based (e.g., Normalized Cut or Ratio Cut). Fundamentally different from the existing combinatorial and spectral solutions, our variational multi-term approach enables to control the trade-off levels between the fairness and clustering objectives. We derive a general tight upper bound based on a concave-convex decomposition of our fairness term, its Lipschitz-gradient property and the Pinsker’s inequality. Our tight upper bound can be jointly optimized with various clustering objectives, while yielding a scalable solution, with convergence guarantee. Interestingly, at each iteration, our general variational fair-clustering algorithm performs an independent update for each assignment variable. Therefore, it can easily be distributed for large-scale datasets. This scalibility is important as it enables to explore different trade-off levels between fairness and the clustering objective. Unlike the constrained spectral relaxation in (Kleindessner et al. 2019), our formulation does not require computing its eigenvalue decomposition. We report comprehensive evaluations and comparisons with state-of-the-art methods over various fair-clustering benchmarks, which show that our variational method can yield highly competitive solutions in terms of fairness and clustering objectives, while being scalable and flexible.

Proposed formulation

Let ${\mathbf{X}}=\{\mathbf{x}_{p}\in\mathbb{R}^{M},p=1,\dots,N\}$ denote a set of $N$ data points to be assigned to $K$ clusters, and ${\mathbf{S}}$ is a soft cluster-assignment vector: ${\mathbf{S}}=[{\mathbf{s}}_{1},\dots,{\mathbf{s}}_{N}]\in[0,1]^{NK}$ . For each point $p$ , ${\mathbf{s}}_{p}=[s_{p,k}]\in[0,1]^{K}$ is the probability simplex vector verifying $\sum_{k}s_{p,k}=1$ . Suppose that the data set contains $J$ different demographic groups, with vector $V_{j}=[v_{j,p}]\in\{0,1\}^{N}$ indicating point assignment to group $j$ : $v_{p,j}=1$ if data point $p$ is in group $j$ and [math] otherwise. We propose the following general variational formulation for optimizing any clustering objective ${\cal F}({\mathbf{S}})$ with a fairness penalty, while constraining each ${\mathbf{s}}_{p}$ within the $K$ -dimensional probability simplex $\nabla_{K}=\{{\mathbf{y}}\in[0,1]^{K}\;|\;{\mathbf{1}}^{t}{\mathbf{y}}=1\}$ :

[TABLE]

$\mathcal{D}_{\mbox{KL}}(U||P_{k})$ denotes the Kullback-Leibler (KL) divergence between the given (required) demographic proportions $U=[\mu_{j}]$ and the marginal probabilities of the demographics within cluster $k$ :

[TABLE]

where $S_{k}=[s_{p,k}]\in[0,1]^{N}$ is the $N$ -dimensional vector 222The set of $N$ -dimensional vectors $S_{k}$ and the set of simplex vectors ${\mathbf{s}}_{p}$ are two equivalent ways for representing assignment variables. However, we use $S_{k}$ here for a clearer presentation of the problem, whereas, as will be clearer later, simplex vectors ${\mathbf{s}}_{p}$ will be more convenient in the subsequent optimization part. containing point assignments to cluster $k$ , and $t$ denotes the transpose operator. Notice that, at the vertices of the simplex (i.e., for hard binary assignments), $V_{j}^{t}S_{k}$ counts the number of points within the intersection of demographic $j$ and cluster $k$ , whereas ${\mathbf{1}}^{t}S_{k}$ is the total number of points within cluster $k$ .

Parameter $\lambda$ controls the trade-off between the clustering objective and fairness penalty. The problem in (2) is challenging due to the ratios of summations in the fairness penalty and the simplex constraints. Expanding KL term $\mathcal{D}_{\mbox{KL}}(U||P_{k})$ and discarding constant $\mu_{j}\log\mu_{j}$ , our objective in (2) becomes equivalent to minimizing the following functional with respect to the relaxed assignment variables, and subject to the simplex constraints:

[TABLE]

Observe that, in Eq. (4), the fairness penalty becomes a cross-entropy between the given (target) proportion $U$ and the marginal probabilities $P_{k}$ of the demographics within cluster $k$ . Notice that our fairness penalty decomposes into convex and concave parts:

[TABLE]

This enables us to derive the following tight bounds (auxiliary functions) for minimizing our general fair-clustering model in (4) using a quadratic bound and Lipschitz-gradient property of the convex part, along with Pinsker’s inequality, and a first-order bound on the concave part.

Definition 1

${\cal A}_{i}({\mathbf{S}})$ * is an auxiliary function of objective ${\cal E}({\mathbf{S}})$ if it is a tight upper bound at current solution ${\mathbf{S}}^{i}$ , i.e., it satisfies the following conditions:*

[TABLE]

where $i$ is the iteration index.

Bound optimizers, also commonly referred to as Majorize-Minimize (MM) algorithms (Zhang, Kwok, and Yeung 2007), update the current solution ${\mathbf{S}}^{i}$ to the next by optimizing the auxiliary function:

[TABLE]

When these updates correspond to the global optima of the auxiliary functions, MM procedures enjoy a strong guarantee: The original objective function ${\cal E}({\mathbf{S}})$ does not increase at each iteration:

[TABLE]

This general principle is widely used in machine learning as it transforms a difficult problem into a sequence of easier sub-problems (Zhang, Kwok, and Yeung 2007). Examples of well-known bound optimizers include concave-convex procedures (CCCP) (Yuille and Rangarajan 2001), expectation maximization (EM) algorithms and submodular-supermodular procedures (SSP) (Narasimhan and Bilmes 2005), among others. The main technical difficulty in bound optimization is how to derive an auxiliary function. In the following, we derive auxiliary functions for our general fair-clustering objectives in (4).

Proposition 1 (Bound on the fairness penalty)

Given current clustering solution ${\mathbf{S}}^{i}$ at iteration $i$ , we have the following auxiliary function on the fairness term in (4), up to additive and multiplicative constants, and for current solutions in which each demographic is represented by at least one point in each cluster (i.e., $V_{j}^{t}S_{k}^{i}\geq 1\,\forall\,j,k$ ):

[TABLE]

where $L$ is some positive Lipschitz-gradient constant verifying $L\leq N$ .

Proof: We provide a detailed proof in the supplemental material. Here, we give the main technical ingredients for obtaining our bound. We use a quadratic bound and a Lipschitz-gradient property for the convex part, and a first-order bound on the concave part. We further bound the quadratic distances between simplex variables with the Pinsker’s inequality (Csiszar and Körner 2011). This step avoids completely point-wise Lagrangian-dual projections and inner iterations for handling the simplex constraints, yielding scalable (parallel) updates, with convergence guarantee.

Proposition 2 (Bound on the clustering objective)

Given current clustering solution ${\mathbf{S}}^{i}$ at iteration $i$ , the auxiliary functions for several popular clustering objectives ${\cal F}({\mathbf{S}})$ take the following general form:

[TABLE]

where point-wise (unary) potentials ${\mathbf{a}}_{p}^{i}$ are given in Table 1.

Proofs: We give detailed proofs in the supplemental material. Here, we provide the main technical aspects: For the Ncut objective, the derivation of the auxiliary function is based on the fact that, for positive semi-definite affinity matrix $\mathbf{W}$ , the Ncut objective is concave (Tang et al. 2019). Therefore, the first-order approximation at the current solution is an auxiliary function. For the prototype-based objectives, deriving an auxiliary function follows from the observation that the optimal parameters $\mathbf{c}_{k}$ , i.e., those that minimize the objective in closed-form, correspond to the sample means/medians within the clusters. These auxiliary functions correspond to the standard K-means and K-median procedures, which can be viewed as bound optimizers (Tang et al. 2019).

Proposition 3 (Bound on the fair-clustering functional)

Given current clustering solution ${\mathbf{S}}^{i}$ , at iteration $i$ , and bringing back the trade-off parameter $\lambda$ , we have the following auxiliary function for the general fair-clustering objective ${\cal E}({\mathbf{S}})$ in Eq. (4):

[TABLE]

Proof: It is straightforward to check that sum of auxiliary functions, each corresponding to a term in the objective, is also an auxiliary function of the overall objective.

Notice that, at each iteration, our auxiliary function in (11) is the sum of independent functions, each corresponding to a single data point $p$ . Therefore, our minimization problem in (4) can be tackled by optimizing each term over ${\mathbf{s}}_{p}$ , subject to the simplex constraint, and independently of the other terms, while guaranteeing convergence:

[TABLE]

Also, notice that, in our derived auxiliary function, we obtained a convex negative entropy barrier function ${\mathbf{s}}_{p}\log{\mathbf{s}}_{p}$ , which comes from the convex part in our fairness penalty. This entropy term is very interesting as it avoids completely expensive projection steps and Lagrangian-dual inner iterations for the simplex constraint of each point. It yields closed-form updates for the dual variables of constraints $\mathbf{1}^{t}{\mathbf{s}}_{p}=1$ and restricts the domain of each ${\mathbf{s}}_{p}$ to non-negative values, avoiding extra dual variables for constraints ${\mathbf{s}}_{p}\geq 0$ . Interestingly, entropy-based barriers are commonly used in Bregman-proximal optimization (Yuan et al. 2017), and have well-known computational benefits when handling difficult simplex constraints (Yuan et al. 2017). However, they are not very common in the general context of clustering.

The objective in (12) is the sum of convex functions with affine simplex constraints ${\mathbf{1}}^{t}{\mathbf{s}}_{p}=1$ . As strong duality holds for the convex objective and the affine simplex constraints, the solutions of the Karush-Kuhn-Tucker (KKT) conditions minimize globally the auxiliary function. The KKT conditions yield a closed-form solution for both primal variables ${\mathbf{s}}_{p}$ and the dual variables (Lagrange multipliers) corresponding to simplex constraints ${\mathbf{1}}^{t}{\mathbf{s}}_{p}=1$ .

[TABLE]

Notice that each closed-form update in (13) is within the simplex. We give the pseudo-code of the proposed fair-clustering in Algorithm 1. The algorithm can be used for any specific clustering objective, e.g., K-means or Ncut, among others, by providing the corresponding ${\mathbf{a}}^{i}_{p}$ . The algorithm consists of an inner and an outer loop. The inner iterations updates ${\mathbf{s}}_{p}^{i+1}$ using (13) until ${\cal A}_{i}({\mathbf{S}})$ does not change, with the clustering term ${\mathbf{a}}^{i}_{p}$ fixed from the outer loop. The outer iteration re-computes ${\mathbf{a}}^{i}_{p}$ from the updated ${\mathbf{s}}_{p}^{i+1}$ . The time complexity of each inner iteration is $\mathcal{O}(NKJ)$ . Also, the updates are independent for each data $p$ and, thus, can be efficiently computed in parallel. In the outer iteration, the time complexity of updating ${\mathbf{a}}_{p}^{i}$ depends on the chosen clustering objective. For instance, for K-means, it is $\mathcal{O}(NKM)$ , and, for Ncut, it is $\mathcal{O}(N^{2}K)$ for full affinity matrix $\mathbf{W}$ or much lesser for a sparse affinity matrix. Note that ${\mathbf{a}}_{p}^{i}$ can be computed efficiently in parallel for all the clusters.

Convergence and monotonicity guarantees: Our variational model belongs to the family of MM procedures, whose theoretical guarantees are well-studied in the literature, e.g., (Vaida 2005). In fact, the MM principle can be viewed as a generalization of well-known expectation-maximization (EM). Therefore, in general, MM algorithms inherit the monotonicity and convergence guarantees of EM algorithms, as detailed in the theoretical discussion in (Vaida 2005). Theorem 3 in (Vaida 2005) states a condition for convergence of the general MM procedure to a local minimum: The auxiliary function has a unique global minimum, which should be obtained at each iteration when solving (7). This condition is important to guarantee, for instance, the monotonicity in (8). Our formulation satisfies this condition. In our case, the auxiliary function in (11) is strictly convex, as it is the sum of linear terms and a strictly convex term (the negative entropy), and is optimized under affine simplex constraints. Therefore, at each iteration, the closed-form solutions we obtained in (13) correspond to the unique global minimum of auxiliary function ${\cal A}_{i}({\mathbf{S}})$ in (11). Our plots in Fig. 1 confirm the convergence and monotonicity of our general MM procedure for several fair-clustering objectives.

Exploring different trade-off levels via multiplier $\lambda$ : Our variational multi-term formulation enables to explore several levels of trade-off between the clustering and fairness objectives via multiplier parameter $\lambda$ , unlike the existing fair-clustering methods. In practice, we run in parallel our algorithm for several values of $\lambda$ and choose the smallest value of $\lambda$ that satisfies a pre-defined level of fairness error, i.e., $\mathcal{D}_{\mbox{KL}}(U||P_{k})\leq\epsilon$ . This is conceptually similar to standard penalty and augmented-Lagrangian approaches in constrained optimization, where the weights of the penalties333In standard constrained optimization, penalties typically take a quadratic form, unlike our method, which is based on a KL divergence penalty. are gradually increased, until reaching a certain pre-defined precision (or duality gap) for the constraints; see Chapter Chapter 17.1 in (Nocedal and Wright 2006). The difference here is that we run independently for each $\lambda$ , which can be implemented in parallel. As illustrated by the plots in Fig. 2, when $\lambda$ increases, the fairness error decreases and the clustering objective increases, which is intuitive. As discussed in more details below (Tables 2, 3 and 4), our variational formulation can achieve small fairness errors (competitive with the existing state-of-the-art fair-clustering methods), but with much better clustering objectives, consistently across all the data sets.

Experiments

In this section, we present comprehensive empirical evaluations of the proposed fair-clustering algorithm, along with comparisons with state-of-the-art fair-clustering techniques. We choose three well-known clustering objectives: K-means, K-median and Normalized cut (Ncut), and integrate our fairness-penalty bound with the corresponding clustering bounds ${\mathbf{a}}_{p}$ (see Table 1). We refer to our bound-optimization versions as: Fair K-means, Fair K-median and Fair Ncut. Note that our formulation can be used for other clustering objectives (if a bound could be derived for the objective).

We investigate the effect of fairness on the original discrete (i.e., w.r.t. binary assignment variables) clustering objectives, and compare with the existing methods. We evaluate the results in terms of the balance of each cluster $S_{k}$ in (1), and define the overall balance of the clustering as $\mbox{balance}=\min_{S_{k}}\mbox{balance}(S_{k})$ . We further propose to evaluate the fairness error, which is the KL divergence $\mathcal{D}_{\mbox{KL}}(U||P_{k})$ in (2). This KL measure becomes equal to zero when the proportions of the demographic groups within all the output clusters match the target distribution. For Ncut, we use $20$ -nearest neighbor affinity matrix, $\mathbf{W}$ : $w({\mathbf{x}}_{p},{\mathbf{x}}_{q})=1$ if data point ${\mathbf{x}}_{q}$ is within the $20$ -nearest neighbors of ${\mathbf{x}}_{p}$ , and equal to [math] otherwise. In all the experiments, we fixed $L=2$ and found that this does not increase the objective (see the detailed explanation in the supplemental material). We standardize each dataset by making each feature attribute to have zero mean and unit variance. We then performed L2-normalization of the features, and used the standard K-means++ (Arthur and Vassilvitskii 2007) to generate initial partitions for all the models.

Datasets

**Synthetic datasets. ** We created two types of synthetic datasets according to the proportions of the demographics, each having two clusters and a total of $400$ data points in 2D features (figures in the supplemental material). The Synthetic dataset contains two perfectly balanced demographic groups, each having an equal number of $200$ points. For this data set, we imposed target target proportions $U=[0.5,0.5]$ . To experiment with our fairness penalty with unequal proportions, we also used Synthetic-unequal dataset with 300 and 100 points within each of the two demographic groups. In this case, we imposed target proportions $U=[0.75,0.25]$ .

Real datasets. We use three datasets from the UCI machine learning repository (Dua and Graff 2017), one large-scale data set whose demographics are balanced (Census), along with two other data sets with various demographic proportions:

Bank 444https://archive.ics.uci.edu/ml/datasets/Bank+Marketing dataset contains $41188$ number of records of direct marketing campaigns of a Portuguese banking institution corresponding to each client contacted (Moro, Cortez, and Rita 2014). Note that, the previous fair clustering methods (Bera et al. 2019; Backurs et al. 2019) used a much smaller version of Bank dataset with only $4520$ number of records with $J=2$ and $3$ attributes. Instead, we utilize the marital status as the sensitive attribute, which contains three groups ( $J=3$ ) – single, married and divorced – and removed the ‘’Unknown” marital status. Thus, we have $41,108$ records in total. We chose $6$ numeric attributes (age, duration, euribor of 3 month rate, no. of employees, consumer price index and number of contacts performed during the campaign) as features. We set the number of clusters $K=10$ , and impose the target proportions of three groups $U=[0.28,0.61,0.11]$ within each cluster.

Adult555https://archive.is.uci/ml/datasets/adult is a US census record data set from 1994. The dataset contains $32,561$ records. We used the gender status as the sensitive attribute, which contains $10771$ females and $21790$ males. We chose the $5$ numeric attributes as features, set the number of clusters to $K=10$ , and impose proportions $U=[0.33,0.67]$ within each cluster.

Census666https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990) is a large-scale data set corresponding to a US census record data from 1990. The dataset contains $2,458,285$ records. We used the gender status as the sensitive attribute, which contains $1,191,601$ females and $1,266,684$ males. We chose the $25$ numeric attributes as features, similarly to (Backurs et al. 2019). We set the number of clusters to $K=20$ , and imposed proportions $U=[0.48,0.52]$ within each cluster.

Results

In this section, we discuss the results of the different experiments to evaluate the proposed general variational framework for Fair K-means, Fair K-median and Fair Ncut. We further report comparisons with (Bera et al. 2019), (Backurs et al. 2019) and (Kleindessner et al. 2019) in terms of discrete fairness measures and clustering objectives.

Trade-off between clustering and fairness objectives: We assess the effect of imposing fairness constraints on the original clustering objectives. In each plot in Fig. 2, the blue curve depicts the discrete-valued clustering objective ${\cal F}({\mathbf{S}})$ (K-means or Ncut) obtained at convergence as a function of $\lambda$ . The fairness error is depicted in red. Observe that, when multiplier $\lambda$ increases (starting from a certain value), the discrete clustering objective increases while the fairness error decreases, which is intuitive. Also, the fairness error approaches [math] when $\lambda\rightarrow+\infty$ , and both the clustering and fairness objectives tend to reach a plateau starting from a certain value of $\lambda$ . The scalability of our model is highly relevant because it enables us to explore several solutions, each corresponding to a different value of multiplier $\lambda$ , and to choose the smallest $\lambda$ (i.e., the best clustering objective) that satisfies a pre-defined fairness level $\mathcal{D}_{\mbox{KL}}(U||P_{k})\leq\epsilon$ . As detailed below, this flexibility enabled us to obtain better solutions, in terms of fairness and clustering objectives, than several recent fair-clustering methods. Low fairness errors are typically achieved with large values of $\lambda$ . This is due to the fact that the scale of the fairness penalty could be much smaller than the clustering objectives. Notice that, for relatively small values of $\lambda$ , the K-means objective (blue curve) for the Adult dataset has an oscillating behaviour. This might be due to the fact that, for small $\lambda$ , the K-means objective dominates the KL fairness term. However, after a certain value of $\lambda$ ( $\lambda\geq 4000$ ), the curves become smooth, with a predictable behaviour (i.e., the fairness term decreases and the clustering term increases). When the clustering objective dominates, the oscillating behaviour might be due to the local minima of bound optimization for the K-means term. We hypothesize that, with higher values of $\lambda$ , the KL fairness term “convexifies” the function, and facilitates optimization. With smaller values of $\lambda$ , the K-means term dominates, with possibilities of being stuck in local minima (K-means is well-known to be sensitive to the initial conditions).

Clustering cost with respect to $K$ : Fig. 3 depicts how the clustering objectives behave w.r.t to the number of clusters $K$ , with and without the fairness constraints. We plot the discrete clustering objective vs. $K$ for K-means, K-medians and Ncut, using the Bank dataset, with each plot corresponding to a fixed multiplier $\lambda$ . In both cases (i.e., with and without the fairness constraints), the obtained clustering objectives decrease with $K$ , with the gap between the clustering objective obtained under fairness constraints and the vanilla clustering increasing with $K$ . Those experimental observations are consistent with the observations in (Bera et al. 2019).

Comparisons to state-of-the-art methods: Our algorithm is flexible as it can be used in conjunction with different well-known clustering objectives. This enabled us to compare our Fair K-median, Fair K-means and Fair Ncut versions to (Backurs et al. 2019), (Bera et al. 2019) and (Kleindessner et al. 2019), respectively. Tables 2, 3 and 4 report comparisons in terms of the original clustering objectives, achieved minimum balances and fairness errors, for Fair K-medians, Fair K-means and Fair NCut. For our model, we run the algorithm for several values of $\lambda$ in ascending order, and choose the smallest $\lambda$ that satisfies a pre-defined level of fairness error. This flexibility and scalability enabled us to obtain significantly better clustering objectives and fairness/minimum-balance measures in comparisons to (Backurs et al. 2019); See Table 2. It is worth noting that, for the Bank dataset, we were unable to run (Backurs et al. 2019) as the number of demographic group is $3$ (i.e. $J>2$ ). In comparison to (Bera et al. 2019), our variational method achieves significantly better K-means clustering objectives, with approximately the same fairness levels. Note that, we can obtain better fairness with larger $\lambda$ values. These results highlight the benefits of our proposed variational formulation, which provides control over the trade-off between the fairness level and clustering objective. In the case of fair NCut, (Kleindessner et al. 2019) achieved slightly better Ncut objectives than our model, while achieving similar fairness levels. However, we were unable to run the spectral solution of (Kleindessner et al. 2019) for large-scale Census II data set, and for Bank, due to its computational and memory load (as it requires computing the eigen values of the square affinity matrix).

Our algorithm scales up to more than two demographic groups, i.e. when $J>2$ (e.g. Bank), unlike many of the existing approaches. Furthermore, for NCut graph clustering, our bound optimizer can deal with large-scale data sets, unlike (Kleindessner et al. 2019), which requires eigen decomposition for large affinity matrices. Finally, the parallel structure of our algorithm within each iteration (i.e., independent updates for each assignment variable) enables to explore different values of $\lambda$ , thereby choosing the best trade-off between the clustering objective and fairness error.

Broader Impact: This paper deals with ensuring fairness criteria in clustering decisions, so as to avoid unfair treatment of minority groups pertaining to a sensitive attribute such as race, gender, etc. The paper is an endeavor to present a flexible mechanism, so as to relatively control the required fairness, while ensuring clustering quality at the same time.

Appendix A Proof of Proposition 1

We present a detailed proof of Proposition 1 (Bound on fairness) in the paper. Recall that, in the paper, we wrote the fairness clustering problem in the following form:

[TABLE]

The proposition for the bound on the fairness penalty states the following: Given current clustering solution ${\mathbf{S}}^{i}$ at iteration $i$ , we have the following tight upper bound (auxiliary function) on the fairness term in (A), up to additive and multiplicative constants, and for current solutions in which each demographic is represented by at least one point in each cluster (i.e., $V_{j}^{t}S_{k}^{i}\geq 1\,\forall\,j,k$ ):

[TABLE]

where $L$ is some positive Lipschitz-gradient constant verifying $L\leq N$

Proof: We can expand each term in the fairness penalty in (A), and write it as the sum of two functions, one is convex and the other is concave:

[TABLE]

Let us represent the $N\times K$ matrix ${\mathbf{S}}=\{S_{1},\dots,S_{K}\}$ in its equivalent vector form ${\mathbf{S}}=[{\mathbf{s}}_{1},\dots,{\mathbf{s}}_{N}]\in[0,1]^{NK}$ , where ${\mathbf{s}}_{p}=[s_{p,1},\dots,s_{p,K}]\in[0,1]^{K}$ is the probability simplex assignment vector for point $p$ . As we shall see later, this equivalent simplex-variable representation will be convenient for deriving our bound.

Bound on $\tilde{g}_{1}({\mathbf{S}})=\sum_{k}g_{1}(S_{k})$ :

For concave part $g_{1}$ , we can get a tight upper bound (auxiliary function) by its first-order approximation at current solution $S_{k}^{i}$ :

[TABLE]

where gradient vector $\nabla g_{1}(S_{k}^{i})=\frac{\mu_{j}}{{\mathbf{1}}^{t}S_{k}^{i}}{\mathbf{1}}$ and $const$ is the sum of all the constant terms. Now consider $N\times K$ matrix ${\mathbf{T}}_{1}=\{\nabla g_{1}(S_{1}^{i}),\dots\nabla g_{1}(S_{K}^{i})\}$ and it equivalent vector representation ${\mathbf{T}}_{1}=[{\mathbf{t}}^{1}_{1},\dots,{\mathbf{t}}^{N}_{1}]\in{\mathbb{R}}^{NK}$ , which concatenates rows ${\mathbf{t}}^{p}_{1}\in{\mathbb{R}}^{K}$ , $p\in\{1,\dots N\}$ , of the $N\times K$ matrix into a single $NK$ -dimensional vector. Summing the bounds in (15) over $k\in\{1,\dots K\}$ and using the $NK$ -dimensional vector representation of both ${\mathbf{S}}$ and ${\mathbf{T}}_{1}$ , we get:

[TABLE]

Bound on $\tilde{g}_{2}({\mathbf{S}})=\sum_{k}g_{2}(S_{k})$ :

For convex part $g_{2}$ , a quadratic upper bound can be found by using Lemma 1 and Definition 1 (both detailed at the end of the document):

[TABLE]

where gradient vector $\nabla g_{2}(S_{k}^{i})=-\frac{\mu_{j}V_{j}}{V_{j}^{t}S_{k}^{i}}\in\mathbb{R}^{N}$ and $L$ is a valid Lipschitz constant for the gradient of $g_{2}$ . Similarly to earlier, consider $N\times K$ matrix ${\mathbf{T}}_{2}=\{\nabla g_{2}(S_{1}^{i}),\dots\nabla g_{2}(S_{K}^{i})\}$ and its equivalent vector representation ${\mathbf{T}}_{2}=[{\mathbf{t}}^{1}_{2},\dots,{\mathbf{t}}^{N}_{2}]\in{\mathbb{R}}^{NK}$ . Using this equivalent vector representations for matrices ${\mathbf{T}}_{2}$ , ${\mathbf{S}}$ and ${\mathbf{S}}^{i}$ , and summing the bounds in (17) over $k$ , we get:

[TABLE]

In our case, the Lipschitz constant is: $L=\sigma_{max}$ , where $\sigma_{max}$ is the maximum eigen value of the Hessian matrix:

[TABLE]

Note that, $\|{\mathbf{S}}-{\mathbf{S}}^{i}\|^{2}$ is defined over the simplex variable of each data point ${\mathbf{s}}_{p}$ . Thus, we can utilize Lemma 2 (Pinsker inequality), which yields the following bound on $\tilde{g}_{2}({\mathbf{S}})$ (Lemma 2 and its proof are detailed below):

[TABLE]

**Total bound on the Fairness term:

**By taking into account the sum over all the demographics $j$ and combining the bounds for $\tilde{g}_{1}({\mathbf{S}})$ and $\tilde{g}_{2}({\mathbf{S}})$ , we get the following bound for the fairness term:

[TABLE]

Note that for current solutions in which each demographic is represented by at least one point in each cluster (i.e., $V_{j}^{t}S_{k}^{i}\geq 1\,\forall\,j,k$ ), the maximum eigen value of the Hessian $\nabla^{2}(g_{2}(S_{k}^{i}))$ is bounded by $N$ , which means $L\leq N$ . Note that, in our case, typically the term $\frac{\mu_{j}}{(V_{j}^{t}S_{k}^{i})^{2}}$ in the Hessian is much smaller than $1$ . Therefore, in practice, setting a suitable positive $L<<N$ does not increase the objective.

Definition 1

A convex function $f$ defined over a convex set $\Omega\in{\mathbb{R}}^{l}$ is L-smooth if the gradient of $f$ is Lipschitz (with a Lipschitz constant $L>0$ ): $\|\nabla f({\mathbf{x}})-\nabla f({\mathbf{y}})\|\leq L.\|{\mathbf{x}}-{\mathbf{y}}\|$ for all ${\mathbf{x}},{\mathbf{y}}\in\Omega$ . Equivalently, there exists a strictly positive $L$ such that the Hessian of $f$ verifies: $\nabla^{2}f({\mathbf{x}})\preceq L{\mathbf{I}}$ where ${\mathbf{I}}$ is the identity matrix.

Remark 1

Let $\sigma_{max}(f)$ denotes the maximum Eigen value of $\nabla^{2}f({\mathbf{x}})$ is a valid Lipschitz constant for the gradient of $f$ because $\nabla^{2}f({\mathbf{x}})\preceq\sigma_{max}(f){\mathbf{I}}$

Lipschitz gradient implies the following bound777This implies that the distance between the $f({\mathbf{x}})$ and its first-order Taylor approximation at ${\mathbf{y}}$ is between [math] and $L.\|{\mathbf{x}}-{\mathbf{y}}\|^{2}$ . Such a distance is the Bregman divergence with respect to the $l_{2}$ norm. on $f({\mathbf{x}})$

Lemma 1 (Quadratic upper bound)

If $f$ is L-smooth, then we have the following quadratic upper bound:

[TABLE]

Proof: The proof of this lemma is straightforward. It suffices to start from convexity condition $f({\mathbf{y}})\geq f({\mathbf{x}})+[\nabla f({\mathbf{x}})]^{t}({\mathbf{y}}-{\mathbf{x}})$ and use Cauchy-Schwarz inequality and the Lipschitz gradient condition:

[TABLE]

Lemma 2 (Pinsker inequaltiy)

For any ${\mathbf{x}}$ and ${\mathbf{y}}$ belonging to the $K$ -dimensional probability simplex ${\cal S}=\{{\mathbf{x}}\in[0,1]^{K}\;|\;{\mathbf{1}}^{t}{\mathbf{x}}=1\}$ , we have the following inequality:

[TABLE]

where $\mathcal{D}_{\mbox{KL}}$ is the Kullback-Leibler divergence:

[TABLE]

Proof: Let $q_{{\mathbf{o}}}({\mathbf{x}})={\cal D}_{k}({\mathbf{x}}||{\mathbf{o}})$ . The Hessian of $q_{{\mathbf{o}}}$ is a diagonal matrix whose diagonal elements are given by: $\frac{1}{x_{k}},k=1,|\dots K$ . Now because ${\mathbf{x}}\in{\cal S}$ , we have $\frac{1}{x_{i}}>1\quad\forall i$ . Therefore, $q_{{\mathbf{o}}}$ is $1$ -strongly convex: $\nabla^{2}q_{{\mathbf{o}}}({\mathbf{x}})\succeq{\mathbf{I}}$ . This is equivalent to:

[TABLE]

The gradient of $q_{{\mathbf{o}}}$ is given by:

[TABLE]

Applying this expression to ${\mathbf{o}}={\mathbf{y}}$ , notice that $\nabla q_{{\mathbf{o}}}({\mathbf{y}})={\mathbf{1}}$ . Using these in expression (24) for ${\mathbf{o}}={\mathbf{y}}$ , we get:

[TABLE]

Now, because ${\mathbf{x}}$ and ${\mathbf{y}}$ are in ${\cal S}$ , we have ${\mathbf{1}}^{t}({\mathbf{x}}-{\mathbf{y}})=\sum_{k}{\mathbf{x}}_{k}-\sum_{k}{\mathbf{y}}_{k}=1-1=0$ . This yields the result in Lemma 2.

Appendix B Proof of Proposition 2

Here we present the proof of Proposition 2 [Bound on the clustering objective]: Given current clustering solution ${\mathbf{S}}^{i}$ at iteration $i$ , the auxiliary functions for several popular clustering objectives ${\cal F}({\mathbf{S}})$ take the following general form:

[TABLE]

where point-wise (unary) potentials ${\mathbf{a}}_{p}^{i}$ are given in Table 1 in the paper.

For K-means, and for each cluster $S_{k}$ , the sample mean of the cluster is the closed-form global optimum of summation $\sum_{p}s_{p,k}({\mathbf{x}}_{p}-\mathbf{c}_{k})^{2}$ , i.e.,

[TABLE]

Therefore, for any $\mathbf{y}$ , we have the following inequality:

[TABLE]

Thus, by setting $\mathbf{y}=\mathbf{c}_{k}^{i}$ at current iteration $i$ , we get the following auxiliary function for K-means:

[TABLE]

Similarly, we can show the same for K-median:

[TABLE]

Thus, considering all the clusters, we write the bound on the clustering objectives in the following simplified form:

[TABLE]

Where ${\mathbf{s}}_{p}=[s_{p,1},\ldots,s_{p,K}]$ and ${\mathbf{a}}_{p}^{i}=[a_{p,1}^{i},\ldots,a_{p,K}^{i}]$ . In the case of K-means, we have: $a_{p,k}^{i}=({\mathbf{x}}_{p}-\mathbf{c}_{k}^{i})^{2}$ , and for K-medians, we have: $a_{p,k}^{i}=\mathtt{d}({\mathbf{x}}_{p}-\mathbf{c}_{k}^{i})$ .

For Ncut, the objective is:

[TABLE]

Note that, discarding the constant number of clusters $K$ , and assuming $\mathbf{W}$ is a positive semi-definite (p.s.d) affinity matrix, one can show that the objective is concave. Therefore, the first-order approximation gives the following linear upper bound for concave ${\cal F}({\mathbf{S}})$ at current iteration $i$ :

[TABLE]

with $a_{p,k}^{i}=d_{p}\frac{(S_{k}^{i})^{t}\mathbf{W}S_{k}^{i}}{(\mathbf{d}^{t}S_{k}^{i})^{2}}-\frac{2\sum_{q}w({\mathbf{x}}_{p},{\mathbf{x}}_{q})s_{p,k}^{i}}{\mathbf{d}^{t}S_{k}^{i}}$ , with degree vector $\mathbf{d}=[d_{p}]$ , $d_{p}=\sum_{q}w({\mathbf{x}}_{p},{\mathbf{x}}_{q})~{}\forall p$ , and for all p.s.d. affinity matrices $\mathbf{W}=[w({\mathbf{x}}_{p},{\mathbf{x}}_{q})]$ .

Appendix C Output clusters with respect to $\lambda$ .

In Fig.4, we plot the output clusters of Fair K-means with respect to an increased value of $\lambda$ , for the synthetic data sets. When $\lambda=0$ , we get the traditional clustering results of K-means without fairness. The result clearly has biased clusters, each corresponding fully to one the demographic groups, with a balance measure equal [math]. In the Synthetic dataset, the balance increases with increased value of parameter $\lambda$ and eventually gain the desired equal balance with a certain increased value of $\lambda$ . We also observe the same trend in the Synthetic-unequal dataset, where the output clusters are found according to prior demographic distribution $U=[0.75,0.25]$ , with almost a null fairness error starting from a certain value of $\lambda$ .

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Arthur and Vassilvitskii (2007) Arthur, D.; and Vassilvitskii, S. 2007. k-means++: The advantages of careful seeding. In ACM-SIAM symposium on Discrete algorithms , 1027–1035. Society for Industrial and Applied Mathematics.
2Backurs et al. (2019) Backurs, A.; Indyk, P.; Onak, K.; Schieber, B.; Vakilian, A.; and Wagner, T. 2019. Scalable fair clustering. International conference on machine learning (ICML) 405–413.
3Bera et al. (2019) Bera, S.; Chakrabarty, D.; Flores, N.; and Negahbani, M. 2019. Fair algorithms for clustering. In Advances in Neural Information Processing Systems , 4955–4966.
4Buolamwini and Gebru (2018) Buolamwini, J.; and Gebru, T. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency , 77–91.
5Celis et al. (2018) Celis, L. E.; Keswani, V.; Straszak, D.; Deshpande, A.; Kathuria, T.; and Vishnoi, N. K. 2018. Fair and Diverse DPP-Based Data Summarization. In International Conference on Machine Learning (ICML) , 715–724.
6Chierichetti et al. (2017) Chierichetti, F.; Kumar, R.; Lattanzi, S.; and Vassilvitskii, S. 2017. Fair Clustering Through Fairlets. In Neural Information Processing Systems (Neur IPS) , 5036–5044.
7Csiszar and Körner (2011) Csiszar, I.; and Körner, J. 2011. Information theory: coding theorems for discrete memoryless systems . Cambridge University Press.
8Donini et al. (2018) Donini, M.; Oneto, L.; Ben-David, S.; Shawe-Taylor, J.; and Pontil, M. 2018. Empirical Risk Minimization Under Fairness Constraints. In Neural Information Processing Systems (Neur IPS) , 2796–2806.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Variational Fair Clustering

Abstract

Introduction

Proposed formulation

Definition 1

Proposition 1** (Bound on the fairness penalty)**

Proposition 2** (Bound on the clustering objective)**

Proposition 3** (Bound on the fair-clustering functional)**

Experiments

Datasets

Results

Appendix A Proof of Proposition 1

Definition 1

Remark 1

Lemma 1** (Quadratic upper bound)**

Lemma 2** (Pinsker inequaltiy)**

Appendix B Proof of Proposition 2

Appendix C Output clusters with respect to λ\lambdaλ.

Proposition 1 (Bound on the fairness penalty)

Proposition 2 (Bound on the clustering objective)

Proposition 3 (Bound on the fair-clustering functional)

Lemma 1 (Quadratic upper bound)

Lemma 2 (Pinsker inequaltiy)

Appendix C Output clusters with respect to $\lambda$ .