Metric Learning for Individual Fairness

Christina Ilvento

arXiv:1906.00250·cs.LG·April 3, 2020

Metric Learning for Individual Fairness

Christina Ilvento

PDF

1 Video

TL;DR

This paper introduces a method to approximate similarity metrics for individual fairness in classification by leveraging human judgments, enabling practical application of fairness guarantees without predefined metrics.

Contribution

It proposes a framework for learning task-specific similarity metrics from limited human queries, including definitions, constructions, and learning procedures for generalization.

Findings

01

Effective metric approximation from limited human queries

02

Theoretical guarantees for generalization of learned metrics

03

Practical approach to implement individual fairness in classification

Abstract

There has been much discussion recently about how fairness should be measured or enforced in classification. Individual Fairness [Dwork, Hardt, Pitassi, Reingold, Zemel, 2012], which requires that similar individuals be treated similarly, is a highly appealing definition as it gives strong guarantees on treatment of individuals. Unfortunately, the need for a task-specific similarity metric has prevented its use in practice. In this work, we propose a solution to the problem of approximating a metric for Individual Fairness based on human judgments. Our model assumes that we have access to a human fairness arbiter, who can answer a limited set of queries concerning similarity of individuals for a particular task, is free of explicit biases and possesses sufficient domain knowledge to evaluate similarity. Our contributions include definitions for metric approximation relevant for…

Equations38

1. f_{r} (u)

1. f_{r} (u)

f_{r}^{T} (x) := t_{i} \in T ar g max {t_{i} \leq D (r, x)}

f_{r}^{T} (x) := t_{i} \in T ar g max {t_{i} \leq D (r, x)}

x, y \sim U \times U Pr [h_{r} (x, y) - D (x, y) \geq α] \leq ε

x, y \sim U \times U Pr [h_{r} (x, y) - D (x, y) \geq α] \leq ε

T_{t}^{r} (u) := {10 D (r, u) \leq t otherwise

T_{t}^{r} (u) := {10 D (r, u) \leq t otherwise

x \sim U Pr [h_{t}^{r} (x) \neq = T_{t}^{r} (x)] \leq ε_{t}

x \sim U Pr [h_{t}^{r} (x) \neq = T_{t}^{r} (x)] \leq ε_{t}

LinearVote (T, H_{T}^{r}, x) := t_{i} ar g max t_{j} < t_{i} \sum (1 - h_{t_{j}}^{r} (x)) + t_{j} \geq t_{i} \sum h_{t_{j}}^{r} (x)

LinearVote (T, H_{T}^{r}, x) := t_{i} ar g max t_{j} < t_{i} \sum (1 - h_{t_{j}}^{r} (x)) + t_{j} \geq t_{i} \sum h_{t_{j}}^{r} (x)

v \sim U Pr [D (u, v) \leq γ] \geq b

v \sim U Pr [D (u, v) \leq γ] \geq b

u, v \sim U \times U Pr [D (u, v) \geq ζ] \geq p

u, v \sim U \times U Pr [D (u, v) \geq ζ] \geq p

\frac{1}{b} (1 - b)^{m} \leq δ

\frac{1}{b} (1 - b)^{m} \leq δ

e^{- mb} \leq b δ

e^{- mb} \leq b δ

m \geq \frac{1}{b} ln (\frac{1}{b δ})

m \geq \frac{1}{b} ln (\frac{1}{b δ})

ε_{t} = \frac{ε _{r}}{2∣ T ∣} = \frac{ε _{R}}{2∣ R ∣∣ T ∣} = \frac{ε α}{\frac{2}{b} ln ( \frac{2}{b δ} )}

ε_{t} = \frac{ε _{r}}{2∣ T ∣} = \frac{ε _{R}}{2∣ R ∣∣ T ∣} = \frac{ε α}{\frac{2}{b} ln ( \frac{2}{b δ} )}

δ_{t} = \frac{δ _{r}}{∣ T ∣} = \frac{δ _{R}}{∣ R ∣∣ T ∣} = \frac{δ α}{\frac{2}{b} ln ( \frac{2}{b δ} )}

δ_{t} = \frac{δ _{r}}{∣ T ∣} = \frac{δ _{R}}{∣ R ∣∣ T ∣} = \frac{δ α}{\frac{2}{b} ln ( \frac{2}{b δ} )}

O_{TRIPLET}^{TCTC} (a, b, c) := ⎩ ⎨ ⎧ if diff \leq α_{L} if diff \in (α_{L}, α_{H}) if diff \geq α_{H} - 1 - 1 or [1 if D (a, b) < D (a, c), 0 if D (a, c) \leq D (a, b)] 1 if D (a, b) < D (a, c), 0 if D (a, c) \leq D (a, b)

O_{TRIPLET}^{TCTC} (a, b, c) := ⎩ ⎨ ⎧ if diff \leq α_{L} if diff \in (α_{L}, α_{H}) if diff \geq α_{H} - 1 - 1 or [1 if D (a, b) < D (a, c), 0 if D (a, c) \leq D (a, b)] 1 if D (a, b) < D (a, c), 0 if D (a, c) \leq D (a, b)

O_{QUAD}^{TCTC} (a, b, c) := ⎩ ⎨ ⎧ if diff \leq α_{L} if diff \in (α_{L}, α_{H}) if diff \geq α_{H} - 1 - 1 or [1 if D (a, b) < D (x, y), 0 if D (x, y) \leq D (a, b)] 1 if D (a, b) < D (x, y), 0 if D (x, y) \leq D (a, b)

O_{QUAD}^{TCTC} (a, b, c) := ⎩ ⎨ ⎧ if diff \leq α_{L} if diff \in (α_{L}, α_{H}) if diff \geq α_{H} - 1 - 1 or [1 if D (a, b) < D (x, y), 0 if D (x, y) \leq D (a, b)] 1 if D (a, b) < D (x, y), 0 if D (x, y) \leq D (a, b)

x \sim M_{t}^{r} Pr [h_{t}^{r} (x) \neq = T_{t}^{r} (x)] \leq ε_{t}

x \sim M_{t}^{r} Pr [h_{t}^{r} (x) \neq = T_{t}^{r} (x)] \leq ε_{t}

Pr [∣ h_{r} (x, y) - D (x, y) ∣ > 4 α_{T}] \leq ε_{r}

Pr [∣ h_{r} (x, y) - D (x, y) ∣ > 4 α_{T}] \leq ε_{r}

ε_{t} = \frac{ε _{r}}{2∣ T ∣} = \frac{ε _{R}}{2∣ R ∣∣ T ∣} = \frac{2 ε α _{H}}{\frac{1}{b} ln ( \frac{2}{b δ} )}

ε_{t} = \frac{ε _{r}}{2∣ T ∣} = \frac{ε _{R}}{2∣ R ∣∣ T ∣} = \frac{2 ε α _{H}}{\frac{1}{b} ln ( \frac{2}{b δ} )}

δ_{t} = \frac{δ _{r}}{∣ T ∣} = \frac{δ _{R}}{∣ R ∣∣ T ∣} = \frac{4 δ α _{H}}{\frac{1}{b} ln ( \frac{2}{b δ} )}

δ_{t} = \frac{δ _{r}}{∣ T ∣} = \frac{δ _{R}}{∣ R ∣∣ T ∣} = \frac{4 δ α _{H}}{\frac{1}{b} ln ( \frac{2}{b δ} )}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Metric Learning for Individual Fairness· youtube

Full text

Metric Learning for Individual Fairness

Christina Ilvento

John A Paulson School of Engineering and Applied Science

Harvard University

Cambridge, MA 02138

[email protected] This work was supported in part by Microsoft Research and the Smith Family Fellowship. The author is grateful for the comments of Cynthia Dwork in the preparation of this manuscript.

Abstract

There has been much discussion concerning how “fairness” should be measured or enforced in classification. Individual Fairness [Dwork, Hardt, Pitassi, Reingold, Zemel, 2012], which requires that similar individuals be treated similarly, is a highly appealing definition as it gives strong treatment guarantees for individuals. Unfortunately, the need for a task-specific similarity metric has prevented its use in practice. In this work, we propose a solution to the problem of approximating a metric for Individual Fairness based on human judgments. Our model assumes access to a human fairness arbiter who is free of explicit biases and possesses sufficient domain knowledge to evaluate similarity. Our contributions include definitions for metric approximation relevant for Individual Fairness, constructions for approximations from a limited number of realistic queries to the arbiter on a sample of individuals, and learning procedures to construct hypotheses for metric approximations which generalize to unseen samples under certain assumptions of learnability of distance threshold functions.

1 Introduction
1.1 Model
1.2 Contributions
1.3 Preliminary terminology and definitions
1.4 Constructing submetrics from arbiter judgments
1.5 Choosing good representative elements
1.6 Generalizing arbiter judgments
1.7 Relaxing the query model
2 Related Work
3 Additional definitions and terminology
4 From human judgments to submetrics
4.1 Constructing metric consistent orderings
4.2 Constructing $\alpha-$ submetrics from orderings
5 Generalization
5.1 Learnability of threshold functions
5.2 Constructing submetric learners from threshold learners
6 Choosing Representatives
6.1 Metric structure dependent strategies.
6.2 Random representatives
6.3 Distance preservation via $\gamma-$ nets
6.4 Density and diffusion
6.5 Generalization with random representative sets
7 Relaxed query model
7.1 Submetrics from human judgements in the too close to call model
7.2 Generalization
8 Discussion
8.1 Summary of main results
8.2 Metric structure
8.3 Resolving disagreements between human fairness arbiters
8.4 Selection of human fairness arbiters
8.5 Query process and interface design
8.6 Arbiter agreement with submetrics
8.7 When arbiters agree but learning is hard
8.8 Comparison of submetrics

1 Introduction

Determining what it means for an algorithm or classifier to be “fair” and how to enforce any such determination has become a subject of considerable interest as automated decision-making increasingly takes the place of direct human judgment. One attractive definition proposed is Individual Fairness [4], which states that similar individuals should be treated similarly, where similarity is encoded in a task-specific metric.

Definition 1 (Individual Fairness [4]).

Given a universe $U$ , a metric $\mathcal{D}:U\times U\rightarrow[0,1]$ for a classification task with outcome set $O$ , and a distance measure $d:\Delta(O)\times\Delta(O)\rightarrow[0,1]$ , a randomized classifier $C:U\rightarrow\Delta(O)$ is Individually Fair if and only if for all $u,v\in U$ , $\mathcal{D}(u,v)\geq d(C(u),C(v))$ .

Individual Fairness is appealing because each person is assured that her treatment is similar to that of any person similar to her.111By way of contrast, notions of fairness based on group level statistics can only provide individuals with the guarantee that if they are treated poorly, either someone in a different group is also treated poorly or someone in their group is treated well. Furthermore, many popular notions of statistical group fairness conflict with each other and cannot be satisfied simultaneously [2, 13]. However, the value of this assurance critically depends on the extent to which the similarity metric $(\mathcal{D})$ faithfully represents society’s best understanding of what constitutes similarity for a given task. Thus, the most significant barrier to implementing Individual Fairness in practice is the need to construct a similarity metric for each classification setting.

In this work we set out a path for constructing metrics for Individual Fairness based on judgments made by a qualified, fair-minded “human fairness arbiter.” Our contributions include: (1) a framework for useful approximations to a metric for Individual Fairness; (2) a limited, realistic query model for determining the arbiter’s judgments of who is similar to whom; (3) a method for constructing approximations to the true metric with limited queries to the arbiter by using distances from a (set of) representative individual(s); (4) a procedure for generalizing these approximations to unseen samples based on limited learnability assumptions. Throughout this work we make no assumption on the form of the metric or the features included in the learning procedure with the clearly stated exception of Assumption 1 concerning learnability of threshold functions. As our results are built upon a series of sequential steps including new terminology and machinery, we first present an extended introduction to highlight the key concepts, logic and results. In Sections 3-7 these results are discussed in greater detail and formal theorem statements and proofs are presented. Related work is discussed in Section 2 Extended discussion of human fairness arbiters and the model is included in Section 8.

1.1 Model

In this work, we take the viewpoint that fairness is not well described by either accuracy or group statistics alone. Instead, we view fairness as a highly contextual property one can identify but not necessarily describe.222 [9] takes a similar approach in which a judge “knows it when she sees it,” but is not required to articulate why a decision is unfair. Our goal is to produce a metric which results in similarity judgments with which fair-minded people would agree, rather than satisfying any particular statistical properties.333We discuss different types of agreement, and the extent to which we fully achieve this goal, in Section 8.6. The core of our model is the human fairness arbiter, a fair-minded individual who is free from explicit biases or arbitrary preferences, is motivated to engage ethically and honestly in the query protocol, and has sufficient knowledge and contextual understanding of who is similar to whom for a particular task. The arbiter is not expected to provide us a description or specification of the distance metric.

A critical part of learning metrics based on human judgments is determining the type of queries to ask in order to solicit consistent, fast responses. To that end, we assume that we cannot ask the arbiter to consider more than a few individuals at a time, e.g., it is not realistic to ask the arbiter to find the closest pair of elements in the universe.

We ask the arbiter to answer two types of queries in this work: relative distance queries, (e.g., is $a$ closer to $b$ or $c$ ), and real-valued distance queries.

Definition 2 (Real-valued distance query).

$\mathsf{O}_{\mathsf{REAL}}(u,v):=\mathcal{D}(u,v)$ .

Definition 3 (Triplet query).

$\mathsf{O}_{\mathsf{TRIPLET}}(a,b,c):=\{1\text{ if }\mathcal{D}(a,b)<\mathcal{D}(a,c)\text{, }0\text{ if }\mathcal{D}(a,c)\leq\mathcal{D}(a,b)\}$ .

Producing a consistent set of real-valued distances is not a natural judgment most people are accustomed to making, so we assume that real-valued queries are very “expensive” for the arbiter to answer. Furthermore, maintaining internal consistency may increase the query cost as the number of queries increases. Relative distance queries have been used successfully for human evaluation in image processing and computer vision, e.g. [19, 21], and we anticipate they will be significantly easier for the arbiter to evaluate. Demonstrating how to replace difficult queries with easy queries is a significant part of our contribution.

We make several simplifying assumptions about the nature of the human fairness arbiter in the main results of this work. (1) There is either one arbiter or all arbiters agree on all decisions. (2) The arbiter does not change her opinion over the query period. (3) The arbiter’s responses are consistent, i.e., if she answers that $a$ is closer to $b$ than it is to $c$ , her responses to real-valued queries will also reflect this relative judgment.444Please see Section 8 for additional details. For the majority of this work, we focus on the query model specified above, which requires the arbiter to answer with arbitrary precision. We also present a relaxed model which allows the arbiter to answer real-valued queries with bounded noise and does not require arbitrarily small distinctions in relative distances queries. The main results presented are replicated in the relaxed model. As the results are similar, we focus on the more simple exact model in the main presentation of our results.555Extended discussion of the exact query model and a more general definition of relative queries is included in Section 3. The relaxed query model is discussed in detail in Section 7.

1.2 Contributions

Approximating the metric by contracting. Our first key observation is that Individual Fairness only requires that we do not overestimate distances. This motivates our definition of a submetric, which is a contraction of the original metric and can be substituted for the original metric and still maintain Individual Fairness.

Constructing submetrics based on distances from representative elements. Taking the difference in distance to a single reference or “representative” point is one of the simplest ways to produce an underestimate of the distance between two elements. Submetrics based on distances from representative elements form the basis of all of our constructions, and although this may seem simplistic, it has a significant advantage when it comes to deciding which queries to ask the arbiter: ordering. An ordering of elements by increasing distance from the representative can be constructed with relative distance (easy) queries used as a comparator. Once this ordering is established, real-valued distances at a given granularity can be layered on top in a sublinear number of real-valued (hard) queries.

Choosing representatives. A single representative may not be sufficient to capture all relevant distance information, but combining the information from multiple representative elements can produce a more complete picture of the distances between all pairs of individuals. But which representatives should we choose to maximize distance preservation? We discuss a general, randomized approach and show that given certain properties of the metric, i.e. how tightly packed individuals are, a random set of representatives of reasonable size will have good distance preservation properties.

Generalizing submetrics to unseen samples. Once we have established how to construct a submetric for a fixed sample of elements, our next step is to generalize to unseen samples. Our results are based on an assumption that threshold functions, i.e. binary indicators of whether an element is closer to a representative than a given threshold, are efficiently learnable. We show how to combine threshold functions to simulate rounding distances to a representative and then exhibit appropriate parameters to construct an efficient combined learning procedure.

Relaxing arbiter requirements. Finally, we present a relaxation of the arbiter query model in which the arbiter (1) may respond to real-valued queries with arbitrary bounded noise and (2) is not required to make arbitrarily precise distinctions between distances and may instead declare relative comparisons to be “too close to call.” This model more closely matches the reality of human arbiters, and our results extend with improvements in query complexity at the cost of increased error magnitude.

1.3 Preliminary terminology and definitions

We refer to the universe of individuals as $U$ , a distribution over the universe of individuals as $\mathcal{U}$ , and the size of the universe as $|U|=N$ . We write $\mathcal{U}^{*}$ for the uniform distribution over $U$ . We assume $\mathcal{D}:U\times U\rightarrow[0,1]$ for simplicity. Individual Fairness does not require that distances between individuals be maintained exactly, only that they not be exceeded. This observation motivates our definition of a submetric which is a contraction of the true metric, i.e., it does not overestimate any distance beyond a small additive error term.666This relaxation is very similar to the notion of $(d,\tau)$ metric fairness of [12] and approximate metric fairness of [15].

Definition 4 ( $\alpha-$ submetric).

Given a metric $\mathcal{D}$ , $\mathcal{D}^{\prime}:U\times U\rightarrow[0,1]$ is an $\alpha$ -submetric of $\mathcal{D}$ if for all $u,v\in U$ , $\mathcal{D}^{\prime}(u,v)\leq\mathcal{D}(u,v)+\alpha$ .

Any classifier which satisfies the distance constraints of the submetric $\mathcal{D}^{\prime}$ will also satisfy those of $\mathcal{D}$ , modulo small additive error.777As originally noted in [4], the distance measure need not be a true metric, i.e. it does not strictly need to obey triangle inequality or distinguish unequal elements. Given an $\alpha$ -submetric it is possible to eliminate the additive error by taking $\max\{0,\mathcal{D}^{\prime}(x,y)-\alpha\}$ . On the other hand, we want to avoid contracting distances to the point of triviality. We say that a submetric is $(\beta,c)-$ nontrivial if a $\beta$ fraction of distances between pairs preserve at least a $c-$ fraction of their original distance.888Nontriviality is defined over a product of identical distributions of elements in the universe. There is no general obstacle to extending our results to more complicated scenarios, but definitions of density (in Section 6) would need to be adjusted.

Definition 5 ( $(\beta,c)-$ nontrivial).

Given a metric $\mathcal{D}$ , a submetric $\mathcal{D}^{\prime}$ of $\mathcal{D}$ is $(\beta,c)$ -nontrivial for the distribution $\mathcal{U}$ if $\Pr_{u,v\sim\mathcal{U}\times\mathcal{U}}\Big{[}\frac{\mathcal{D}^{\prime}(u,v)}{\mathcal{D}(u,v)}\geq c\Big{]}\geq\beta$ .

1.4 Constructing submetrics from arbiter judgments

A core component of this work is constructing submetrics based on distance information (either exact or underestimated) from a single representative element. We define the representative submetric $\mathcal{D}_{r}$ in the following Lemma. (The proof of follows from triangle inequality.)

Lemma 1.

Given a representative $r$ , $\mathcal{D}_{r}(x,y):=|\mathcal{D}(r,x)-\mathcal{D}(r,y)|$ is a 0-submetric of $\mathcal{D}$ .

Given a sample of $N$ individuals, $\mathcal{D}_{r}$ can be constructed from $O(N)$ queries to $\mathsf{O}_{\mathsf{REAL}}$ . Although $O(N)$ may seem good compared with the $O(N^{2})$ queries required to reconstruct the whole metric, it can be improved to $O(\log(N))$ by supplementing with relative distance queries. Our general strategy will be to show that (1) an ordering of elements by distance from a representative can be constructed using $\mathsf{O}_{\mathsf{TRIPLET}}$ as a comparator, and (2) given this ordering, the real-valued distances between each element and the representative can be closely approximated by labeling the ordering with distances at granularity $\alpha$ , which requires a sublinear number of real-valued queries. Algorithm 1 outlines this process.999See Section 4 Algorithms 3 and 4 for the detailed specifications and analysis.

Theorem 2 states that Algorithm 1 produces an $\alpha-$ submetric, which follows from observing that rounding $\mathcal{D}(r,x)$ and $\mathcal{D}(r,y)$ down by at most $\alpha$ results in an increase (or decrease) of at most $\alpha$ in $|\mathcal{D}(r,x)-\mathcal{D}(r,y)|$ . The bound of $O(N\log(N))$ relative distance queries follows from a straightforward analysis of sorting. The bound of $O(\max\{\frac{1}{\alpha},\log(N)\})$ real-valued queries is included in Section 4. Briefly, the analysis considers the maximum number of continuous ranges that, when split, result in one range with difference greater than $\alpha$ and one with less. In the worst case, this results in logarithmic dependency on $N$ or $\frac{1}{\alpha}$ .

Theorem 2.

Algorithm 1 produces an $\alpha-$ submetric of $\mathcal{D}$ which preserves $\mathcal{D}(r,u)$ for each $u\in U$ (with additive error $\leq\alpha$ ) from $O(\max\{\frac{1}{\alpha},\log(N)\})$ queries to $\mathsf{O}_{\mathsf{REAL}}$ and $O(N\log(N))$ queries to $\mathsf{O}_{\mathsf{TRIPLET}}$ .

The submetric produced by Algorithm 1 preserves distances between $r$ and other elements well, as $\mathcal{D}^{\prime}_{r}(r,x)$ is rounded down by at most $\alpha$ , but we cannot make guarantees on distance preservation between arbitrary pairs without further information. For example, with only the information that $u$ and $v$ are equally distant from $r$ , it is impossible to distinguish whether the distance between $u$ and $v$ is zero or equal to twice their distance from $r$ . (See Figure 1). Submetrics constructed based on different representatives preserve different information about the underlying metric, so we can construct more expressive submetrics by aggregating information from multiple representatives. Taking $\mathsf{maxmerge}(\{\mathcal{D}_{i}\},x,y):=\max_{i}\mathcal{D}_{i}(x,y)$ , it’s straightforward to show that if all $\mathcal{D}_{i}$ are submetrics of $\mathcal{D}$ , then the $\mathsf{maxmerge}$ of the set is also a submetric of $\mathcal{D}$ , and that the merge preserves the “best” distance known for each pair.101010Formal analysis of $\mathsf{maxmerge}$ and the proof of Lemma 1 appear in Section 3. The proof of Theorem 2 as well as a precise description of Algorithm 1 appear in Section 4.

1.5 Choosing good representative elements

Although the $\mathsf{maxmerge}$ of submetrics based on multiple representatives is an improvement over a single representative, we still cannot make any guarantees about distances between pairs which do not include a representative. There are two approaches one might take to give non-triviality guarantees for arbitrary pairs: (1) develop specialized strategies for combining representative submetrics which depend on the structure of the metric, e.g., Euclidean distance, or (2) characterize generic randomized representative selection strategies. In this extended introduction, we focus on the randomized strategies for full generality.

Distance preservation via $\gamma-$ nets. The crux of the argument for nontriviality with random representatives is (1) a random set of representatives is likely to be “close to” a significant portion of the distribution $\mathcal{U}$ , and (2) we can bound the magnitude of underestimates based on the distance from a representative. Below, we formally define a $\gamma-$ net to capture the notion of being “close to” or “covering” a set of elements.

Definition 6.

A set $R\subseteq U$ is said to form a $\gamma-$ net for a subset $V\subseteq U$ under $\mathcal{D}$ if for all balls of radius $\gamma$ (determined by $\mathcal{D}$ ) containing at least one element $v\in V$ , the ball also contains $r\in R$ .

Intuitively, the distance between $r$ and $x$ will be nearly identical to the distance between a close neighbor of $r$ and $x$ , so we can conclude that if a set of representatives forms a $\gamma-$ net for a subset of $U$ , then pairs with at least one element in the net will have their original distance preserved up to a $2\gamma$ contraction. (Proofs of Lemmas 3 and 4 follow from triangle inequality.)

Lemma 3.

For all $u,v\in U\backslash\{r\}$ , $\mathcal{D}(u,v)-\mathcal{D}_{r}(u,v)\leq\min\{2\mathcal{D}(r,u),2\mathcal{D}(r,v)\}$ , where $\mathcal{D}_{r}(u,v):=|\mathcal{D}(r,u)-\mathcal{D}(r,v)|$ .

Lemma 4.

If a set of representatives $R\subseteq U$ forms a $\gamma-$ net for $V\subseteq U$ , then for every pair $x,y\in V\times U$ there exists $r\in R$ such that $\mathcal{D}(x,y)-\mathcal{D}_{r}(x,y)\leq 2\gamma$ , where $\mathcal{D}_{r}(x,y):=|\mathcal{D}(r,x)-\mathcal{D}(r,y)|$ .

Of course, forming a $\gamma-$ net for an arbitrary $\gamma$ isn’t enough to give a good nontriviality guarantee. To understand how representatives which form a $\gamma-$ net will preserve distances, we define density and diffusion below to characterize the relevant properties of the metric and distribution. The notion of $(\gamma,a,b)-$ dense is intended to capture the weight ( $a$ ) of elements that have a significant weight ( $b$ ) on their close neighbors (distance $\gamma$ ) under $\mathcal{U}$ as a way to characterize how likely it is that a randomly chosen representative will be $\gamma$ -close to a significant fraction of elements.

Definition 7 ( $(\gamma,a,b)-$ dense).

Given a distribution $\mathcal{U}$ over $U$ , a metric $\mathcal{D}$ is $(\gamma,a,b)-$ dense for $\mathcal{U}$ if there exists a subset $A\subseteq U$ with weight $a$ under $\mathcal{U}$ such that for all $u\in A$ $\Pr_{v\sim\mathcal{U}}[\mathcal{D}(u,v)\leq\gamma]\geq b$ .

$(p,\zeta)-$ diffuse, defined below, captures what fraction of distances can tolerate a contraction proportional to $\zeta$ without becoming trivial.

Definition 8 ( $(p,\zeta)-$ diffuse).

Given a distribution $\mathcal{U}$ , a metric $\mathcal{D}$ is $(p,\zeta)-$ diffuse if the fraction of distances between pairs of elements in $\mathcal{U}\times\mathcal{U}$ greater than $\zeta$ is $p$ , i.e. $\Pr_{u,v\sim\mathcal{U}\times\mathcal{U}}[\mathcal{D}(u,v)\geq\zeta]\geq p$ .

A metric can be described by many combinations of density and diffusion parameters, as illustrated in Figure 2. These parameters are highly related, and we generally consider the combination of $(\gamma,a,b)-$ dense and $(p,\frac{2\gamma}{1-c})-$ diffuse. Although $\frac{2\gamma}{1-c}$ initially seems an arbitrary quantity, it indicates that a $p-$ fraction of pairs will have distances preserved by a factor of $c$ if the maximum contraction for those pairs is no more than $2\gamma$ . Thus the values of $\gamma$ and $c$ , which in turn dictate $p$ , $a$ , and $b$ , (assuming $\zeta=\frac{2\gamma}{1-c}$ ) can loosely be seen as a tradeoff between how many pairs will have distance preservation guarantees and how significant the guarantees will be.

Nontriviality properties of $\gamma-$ nets. Next, we relate the magnitude of $\gamma$ to the non-triviality properties of the $\mathsf{maxmerge}$ of a set of representative submetrics. Lemma 5 states that a submetric based on a set of representatives which form a $\gamma-$ net for a subset of $U$ will have nontriviality properties related to the diffusion properties of $\mathcal{D}$ and the weight of the subset in $\mathcal{U}$ .

Lemma 5.

If a set of representatives $R\subseteq U$ form a $\gamma-$ net for weight $w$ of $\mathcal{U}$ and $\mathcal{D}$ is $(p,\frac{2\gamma}{1-c})-$ diffuse on $\mathcal{U}$ , then the submetric $\mathcal{D}_{R}(x,y):=\mathsf{maxmerge}(\{\mathcal{D}_{r}|r\in R\},x,y)$ is $(p^{\prime},c)-$ nontrivial for $\mathcal{U}$ , where $p^{\prime}\geq p-(1-w)^{2}$ .

The proof follows from a worst-case analysis of the fraction of pairs with at least one element in the net with distance large enough that a $2\gamma$ contraction leaves at least a $c$ -fraction of the original distance. The nontriviality guarantees of Lemma 5 are conservative, and we stress that our goal is to show the possibility of positive results, rather than achieving optimal performance or guarantees.

Representative set size. We now consider how likely it is that a set of random representatives drawn from $\mathcal{U}$ will form a $\gamma-$ net for a significant fraction of $\mathcal{U}$ . Lemma 6 characterizes the necessary representative set size based on the density and diffusion properties of the metric. The proof follows from characterizing the probability of “hitting” a sufficient weight of the distribution with a sample of a given size, and arguing that no element in our subset of interest can be more than $3\gamma$ far from any of the “hitting” elements.

Lemma 6.

*Given access to unlimited queries to the arbiter, if a metric $\mathcal{D}$ is $(\gamma,a,b)-$ dense and $(p,\frac{6\gamma}{1-c})-$ diffuse on $\mathcal{U}$ , then a random set of representatives $R$ of size at least $\frac{1}{b}\ln(\frac{1}{b\delta})$ will produce a $(p-(1-a)^{2},c)$ -nontrivial submetric for $\mathcal{U}$ with probability at least $1-\delta$ . *

Random sampling is not the only method to construct a $\gamma-$ net, and our strategy is motivated by simplicity as much as generality. In practice it may be more efficient to use the distance information from previously selected representatives to inform the selection of the next representative. For example, omitting or down-weighting any candidates that are already very close to existing representatives, or using a greedy strategy.111111Section 6 contains proofs for Lemmas 3-6 and extended discussion of specialized strategies for representative selection, in particular strategies taking advantage of known metric structure.

1.6 Generalizing arbiter judgments

Now that we have shown how to construct a nontrivial submetric with ongoing access to the arbiter, we consider the problem of generalizing the arbiter’s responses to unseen samples. Our goal is to construct efficient learners for submetrics as in Valiant’s Probably Approximately Correct (PAC) model of learning [18]. However, we do not want to be too prescriptive about the submetric concept class, particularly about the representation of elements. Instead, we will make an assumption about the learnability of threshold functions and construct learning procedures for submetrics using threshold functions as building blocks without any additional direct access to labeled or unlabeled samples. More formally, our goal is to produce an efficient submetric learner, defined below.

Definition 9 (Efficient submetric learner).

A learning procedure is an efficient $\alpha-$ submetric learner if for all $\varepsilon,\delta\in(0,1]$ , given access to labeled examples, with probability at least $1-\delta$ over the randomness of the sampling and the learning procedure produces a hypothesis $h_{r}$ such that $\Pr_{x,y\sim\mathcal{U}\times\mathcal{U}}[h_{r}(x,y)-\mathcal{D}(x,y)\geq\alpha]\leq\varepsilon$ in time $O(poly(\frac{1}{\varepsilon},\frac{1}{\delta}))$ .

To show how to construct an efficient submetric learner, we first formalize our assumption of learnability of threshold functions. Next, we show how to combine the threshold function hypotheses for each representative to simulate rounding the distance between the representative and each element down to the nearest threshold. Finally, we specify the appropriate parameters for each component to achieve the desired bounds.

Learnability of threshold functions. Assumption 1 (below) states that for every representative, there exists a set of thresholds and a learner for each threshold which, with high probability, produces an accurate hypothesis for the threshold function which generalizes to unseen samples.121212The formal statement of Assumption 1 is included in Section 5.1. (“With high probability” always refers to the probability over the randomness of sampling and the learner.) We first formally define a threshold function, which is a binary indicator of whether a particular element $u\in U$ is within distance $t\in[0,1]$ of a representative $r$ as $T_{t}^{r}(u):=\{1$ if $\mathcal{D}(r,u)\leq t,$ [math] otherwise $\}$ .

Assumption 1.

(Informal) Given a metric $\mathcal{D}$ and a representative $r$ , there exists a set of thresholds $\mathcal{T}$ such that $t\in[0,1]$ for all $t\in\mathcal{T}$ , $0\in\mathcal{T}$ , and $|\mathcal{T}|=O(1)$ , and for every $t\in\mathcal{T}$ there exists an efficient learner $L_{t}^{r}(\varepsilon_{t},\delta_{t})$ which for all $\varepsilon_{t},\delta_{t}\in(0,1]$ , with probability at least $1-\delta_{t}$ , produces a hypothesis $h_{t}^{r}$ such that $\Pr_{x\sim\mathcal{U}}[h_{t}^{r}(x)\neq T_{t}^{r}(x)]\leq\varepsilon_{t}$ in time $O(poly(\frac{1}{\varepsilon_{t}},\frac{1}{\delta_{t}}))$ with access to labeled samples of $T_{t}^{r}(u\sim\mathcal{U})$ for any distribution $\mathcal{U}$ .

Constructing submetric learners from threshold learners. Given Assumption 1, our next step is to determine how to combine the threshold learners into a learner for the representative submetric. (Notice that training data for the threshold function learners can be produced by post-processing the outputs of Algorithm 1.) Our strategy is similar to the rounding strategy used in Algorithm 1, using the threshold functions to identify the largest threshold which underestimates the distance between the representative and the element under consideration. The $\mathsf{LinearVote}$ mechanism takes in a set of hypotheses for the thresholds and outputs the threshold with which the most hypotheses agree. When all hypotheses output the correct value of their corresponding threshold function, $\mathsf{LinearVote}$ is equivalent to rounding $\mathcal{D}(r,x)$ down to the nearest threshold.

Definition 10 ( $\mathsf{LinearVote}$ ).

Given an ordered set of thresholds, $\mathcal{T}=\{t_{1},t_{2},\ldots,t_{|T|}\}$ , and a set of hypotheses $H_{\mathcal{T}}^{r}=\{h_{t_{1}}^{r},h_{t_{2}}^{r},\ldots,h_{t_{|T|}}^{r}\}$ , one corresponding to each threshold function, $\mathsf{LinearVote}(\mathcal{T},H_{\mathcal{T}}^{r},x):=\operatorname*{\arg\max}_{t_{i}}\sum_{t_{j}<t_{i}}(1-h_{t_{j}}^{r}(x))+\sum_{t_{j}\geq t_{i}}h_{t_{j}}^{r}(x)$ .

Algorithm 2 combines all of our constructions thus far to create an efficient submetric learner: it chooses a set of representatives, learns threshold functions for each threshold for each representative, and combines the resulting hypotheses using $\mathsf{LinearVote}$ and $\mathsf{maxmerge}$ to produce a single submetric hypothesis.131313Algorithm 2 summarizes Algorithms 6-8, see Sections 5 and 6. Theorem 7 builds on the result of Lemma 6 and concludes that the parametrization of Algorithm 2 results in an efficient submetric learner.

Theorem 7.

*[Informal] Given a distance metric $\mathcal{D}$ , and a distribution $\mathcal{U}$ over the universe, if there exist a set of thresholds $\mathcal{T}$ with maximum gap $\alpha_{\mathcal{T}}$ and efficient learners $\{L_{t_{i}\in\mathcal{T}}^{r}\}$ as in Assumption 1, and $\mathcal{D}$ is $(\gamma,a,b)-$ dense and $(p,\frac{6\gamma+\alpha_{\mathcal{T}}}{1-c})-$ diffuse on $\mathcal{U}$ , then there exists an efficient $\alpha_{\mathcal{T}}$ -submetric learner which produces a hypothesis $h_{R}$ such that $h_{R}$ is $(p-(1-a)^{2}-\varepsilon,c)-$ nontrivial for $\mathcal{U}$ . *

The proof of Theorem 7 follows from an analysis of the error parameter propagation.141414See the proofs of Theorems 19 and 20 for detailed analysis. We briefly give some intuition for the analysis and implications of the theorem. First, the magnitude $\alpha_{\mathcal{T}}$ error follows from the same single direction rounding argument as for Algorithm 1. The error probability follows from noticing that at least one threshold function must be in error for one of the elements to result in an error in $\mathsf{LinearVote}$ . The failure probability “budget” is split evenly between failure to choose a good set of representatives (Line 1) as specified in Lemma 6, and failure of the underlying learning procedures (Line 3) derived by union bound. Compared with Lemma 6, the diffusion and nontriviality parameters are adjusted to take into account the additional rounding error magnitude of $\alpha_{\mathcal{T}}$ introduced by $\mathsf{LinearVote}$ and the combined hypothesis error probability $\varepsilon$ . In practice, we expect that the set of thresholds which are learnable are unlikely to occur at regular intervals. Post-processing is a valuable tool to reduce the magnitude of $\alpha_{\mathcal{T}}$ (by re-mapping the threshold values in step 4 to reduce the maximum gap), but comes at the cost of reduced nontriviality guarantees.

The desired query complexity to the arbiter follows from basic analysis of the parameters. However, the query complexity bound can be improved significantly by observing that no independence of errors between threshold functions is assumed, allowing a single call to Algorithm 1 for each representative (rather than $|\mathcal{T}|$ calls). The dependence on $|R|$ can also be improved to logarithmic by sorting a single merged list of (representative, element) pairs, but we defer detailed discussion to Sections 5 and 6.

1.7 Relaxing the query model

Our results extend to a relaxed model in which arbiters are not expected to make arbitrarily small distinctions between distances or individuals and may answer real-valued queries with bounded noise. The relaxed model assumes that there are two fixed constants, $\alpha_{L}$ , the minimum precision with which the arbiter can distinguish elements or distances, and $\alpha_{H}$ , a bound on the magnitude of the (potentially biased) noise in the arbiter’s real-valued responses. For any comparisons with difference smaller than $\alpha_{L}$ , the arbiter declares the elements indistinguishable or the difference “too close to call.” The model allows for a “gray area” between $\alpha_{L}$ and $\alpha_{H}$ in which the arbiter may either respond with the true answer or “too close to call.” For any differences larger than $\alpha_{H}$ , the arbiter responds with the true answer.

For the most part, our results translate to the relaxed model with minimal modification to the logic of the proofs to handle two-sided error in real-valued queries. Interestingly, the real-value query complexity improves to constant, as the worst-case behavior in Algorithm 1 is avoided as the arbiter “knows” not to worry about inconsequentially small distances. However, this does result in additional error magnitude, so the improved query complexity does not come for free. Furthermore, unlike the exact model we won’t necessarily be able to label a sample with perfect accuracy for every threshold function learner due to the bi-directional error. To handle this labeling problem, we modify the distribution of samples presented to each learner, eliminating samples whose labels are ambiguous, again resulting in increased error. Formal results in the relaxed model are discussed in Section 7.

2 Related Work

Metric learning is a richly studied area. Two surveys [1, 14] provide an overview of the literature unrelated to Individual Fairness. There is a significant body of literature concerned with learning distance metrics from human feedback in practice with heuristic optimization for applications like image similarity, feature identification and other applications including [8], [16], [22], [10], [21], [19].

With respect to constructing metrics for Individual Fairness or generalizing individually fair classifiers to unseen samples, we highlight four recent works. [9] considers an online linear contextual bandits setting, and imposes a fairness constraint that similar contexts should be treated similarly, where similarity is assumed to be a Mahalanobis distance. [9] takes a similar view of human feedback for learning fairness to ours, but their online setting, fairness model, metric assumptions and goals are different. The work most similar to ours of Jung et al. ([11]) has very similar motivation, but their model is restricted to equivalence (or near equivalence) queries. They consider the problem of arbiter consistency and explicitly consider the multiple arbiter model in their empirical work. The equivalence model considered in [11] can be expressed in the relaxed arbiter model of this work, (i.e., allowing a large too-close-too-call region and allowing for an appropriately sized noise parameter to place no requirement on arbiters reporting values other than “not equal”). Applying the results of this work to the multi-arbiter empirical model proposed by Jung et al., either by attempting to elicit more nuanced judgments beyond equivalence or to better understand the properties of the equivalence-only model versus the more general relaxed model, is an exciting direction for future work. [15] and [12] consider the problem of generalizing Individual Fairness with differing levels of oracle access to the metric, and one could view our results as providing a path for efficiently generating metric samples for these settings. Our notion of a submetric is similar to $(d,\tau)$ metric fairness of [12], and our definition of efficient submetric learner is very close to the definition of “approximately metric-fair” of [15]. We view the present work as a complement to these directions.

With respect to query types and human fairness judges, as previously noted Gillen et al, [9], consider a similar model in which a human judge ‘knows unfairness when she sees it.’ Dasgupta and Luby, [3] also consider the benefits of “partial feedback” from a human expert in clustering applications with very similar motivation to our query type choices.

The problems of ranking and scoring are closely related to the problem of combining arbiter judgments to construct orderings based on relative queries. In this work, we did not address how to handle differences in orderings between arbiters. However, there is a significant body of work concerned with aggregating or combining orderings or rankings from multiple sources. For example, Dwork, Kumar, Naor and Sivakumar consider the problem of combining rankings from multiple sources in [6]. Volkovs, Larochelle and Zemel consider rank aggregation as a supervised learning problem, and consider questions of crowd-sourcing in [20]. Dwork, Kim, Reingold, Rothblum and Yona consider the fairness and accuracy properties of rankings in [5].

3 Additional definitions and terminology

In addition to the preliminary terminology and definitions introduced in Section 1, there are several additional definitions and dilemmas which will prove useful in the more technical discussion of later sections. In particular, we introduce one additional relative query type, the concept of a consistent underestimator () and explicit characterization of representative submetrics and representative set submetrics.

Expanded Query Model

As mentioned in Section 1, we restrict ourselves to queries involving a limited number of elements. In addition to the triplet query, we consider a second type of relative query, the quad query, which asks the arbiter to compare distances between two distinct pairs.

Definition 2 (Real query).

$\mathsf{O}_{\mathsf{REAL}}(u,v):=\mathcal{D}(u,v)$

Definition 3 (Triplet query).

$\mathsf{O}_{\mathsf{TRIPLET}}(a,b,c):=\{1\text{ if }\mathcal{D}(a,b)<\mathcal{D}(a,c)\text{, }0\text{ if }\mathcal{D}(a,c)\leq\mathcal{D}(a,b)\}$ .

Definition 11 (Quad query).

$\mathsf{O}_{\mathsf{QUAD}}(a,b,x,y):=\{1$ if $\mathcal{D}(a,b)>\mathcal{D}(x,y)$ , [math] otherwise $\}$ .

Relative distance queries have been used successfully for human evaluation in image processing and computer vision, e.g. [19, 21]. A quad query does require the human fairness arbiter to consider an additional element compared with a triplet query. This may result in additional overhead for the human fairness arbiter, particularly in cases where examining each individual requires significant time. As such, we consider quad queries slightly more costly than triplet queries, but still significantly less costly than real-valued distance queries. Furthermore, the binary nature of the response makes parallelizing relative distance queries between several human fairness arbiters (who are in agreement) straightforward.

For brevity in algorithms and theorem statements we will refer to $\mathsf{O}_{\mathsf{REAL}}$ , $\mathsf{O}_{\mathsf{TRIPLET}}$ and $\mathsf{O}_{\mathsf{QUAD}}$ as the interfaces to the arbiter.

Additional definitions and lemmas.

We now introduce several additional definitions and lemmas which simplify discussion in later technical sections.

A core component of this work is that submetrics can be constructed based on distance information from a single representative element. We refer to the submetric constructed from differences in distance to a particular representative $r$ as $\mathcal{D}_{r}$ .

Definition 12 (Representative Submetric).

Given a representative $r\in U$ , we define the submetric $\mathcal{D}_{r}(u,v):=|\mathcal{D}(r,u)-\mathcal{D}(r,v)|$ for all $u,v\in U$ .

The following straightforward lemma and proof explicitly, restated from the introduction, show that $\mathcal{D}_{r}$ as defined is a $0-$ submetric of $\mathcal{D}$ .

Lemma 1 (Restatement).

$\mathcal{D}_{r}(u,v):=|\mathcal{D}(r,u)-\mathcal{D}(r,v)|$ for $r,u,v\in U$ is a $0-$ submetric of $\mathcal{D}$ .

Proof.

The proof follows from triangle inequality. $\mathcal{D}(r,u)\leq\mathcal{D}(r,v)+\mathcal{D}(u,v)$ , thus $\mathcal{D}(r,u)-\mathcal{D}(r,v)\leq\mathcal{D}(u,v)$ . Likewise, $\mathcal{D}(r,v)-\mathcal{D}(r,u)\leq\mathcal{D}(u,v)$ . Thus $|\mathcal{D}(r,u)-\mathcal{D}(r,v)|\leq\mathcal{D}(u,v)$ . ∎

Lemma 1 shows that $\mathcal{D}_{r}(x,y)$ constructed from exact evaluations of $|\mathcal{D}(r,x)-\mathcal{D}(r,y)|$ is a submetric, but in practice we will want to construct submetrics from approximate evaluations of $\mathcal{D}(r,x)$ and $\mathcal{D}(r,y)$ . Just as a submetric is a contraction of the true metric, a consistent underestimator is a contraction of the distance between a representative and other elements of the universe. The key property of a consistent underestimator is that distances are contracted consistently, i.e. a weak ordering of distances from $r$ is preserved.

Definition 13 (Consistent Underestimator).

Given a universe $U$ and a metric $\mathcal{D}:U\times U\rightarrow[0,1]$ , a function $f_{r}:U\rightarrow[0,1]$ is said to be an $\alpha-$ consistent underestimator for $r$ with respect to $\mathcal{D}$ if for all $u,v\in U$

[TABLE]

We define the maximum contraction of a consistent underestimator $c_{max}:=\max_{u,v\in U\times U}\mathcal{D}(u,v)-|f_{r}(u)-f_{r}(v)|$ .

Analogous to the construction of $\mathcal{D}_{r}$ , an $\alpha-$ consistent underestimator for a representative $r$ can also be used to construct an $\alpha-$ submetric, denoted $\mathcal{D}_{r}^{\prime}$ . Figure 3 illustrates the difference in the exact evaluation of $\mathcal{D}(r,x)$ versus the consistent underestimator.

Definition 14 (Representative Consistent Underestimator Submetric).

Given a representative $r$ and $f_{r}$ , an $\alpha-$ consistent underestimator for $\mathcal{D}_{r}$ , we define the $\alpha-$ submetric $\mathcal{D}_{r}^{\prime}(u,v):=|f_{r}(u)-f_{r}(v)|$ for all $u,v\in U$ .

In the following lemma and proof, we explicitly state and show that the construction of $\mathcal{D}_{r}^{\prime}$ , as specified in Definition 14, results in an $\alpha-$ submetric.

Lemma 8.

Given an $\alpha-$ consistent underestimator $f_{r}$ for $r$ with respect to $\mathcal{D}$ , $\mathcal{D}_{r}^{\prime}(u,v):=|f_{r}(u)-f_{r}(v)|$ is an $\alpha-$ submetric of $\mathcal{D}$ .

Proof.

Notice that $|f_{r}(u)-f_{r}(v)|\leq|\mathcal{D}(r,u)-\mathcal{D}(r,v)|+\alpha$ by Definition 13 (3), and that $|\mathcal{D}(r,u)-\mathcal{D}(r,v)|\leq\mathcal{D}(u,v)$ by triangle inequality. ∎

We use “submetric” to refer to submetrics in general with unspecified $\alpha$ , and $0-$ submetric to explicitly reference submetrics with no additive error. Of course, given an $\alpha$ -submetric it is possible to produce a $0-$ submetric via postprocessing.

Proposition 9.

Given an $\alpha$ -submetric $\mathcal{D}^{\prime}$ of $\mathcal{D}$ , the submetric $\mathcal{D}_{1}^{\prime}(x,y):=\max\{0,\mathcal{D}^{\prime}(x,y)-\alpha\}$ is a $0-$ submetric of $\mathcal{D}$ .

Now that we have specified submetrics based on exact distances from a representative and a consistent underestimator for a representative, we consider the nontriviality properties of these submetrics. Notice that although the overestimate magnitude $\alpha$ is independent of $r$ , the distances preserved are highly dependent on the choice of $r$ . (See Figure 1.) $\mathcal{D}_{r}$ exactly preserves the distance between $r$ and every $u\in U$ , so we can conclude that $\mathcal{D}_{r}$ is $(\frac{1}{N},1)-$ nontrivial for $\mathcal{U}^{*}$ . (Notice that $r$ has a $\frac{1}{N}$ probability of selection in $\mathcal{U}^{*}$ ). Likewise, $\mathcal{D}_{r}^{\prime}$ with maximum contraction $c_{max}$ will preserve $\mathcal{D}(r,u)-c_{max}$ for all $u\in U$ . Thus we can relate nontriviality for a distribution $\mathcal{U}$ to $\Pr_{u\sim\mathcal{U}}[\mathcal{D}(r,u)>c_{max}]$ . However we cannot make guarantees on distance preservation for distances between arbitrary pairs in $\mathcal{U}\times\mathcal{U}$ under $\mathcal{D}_{r}$ or $\mathcal{D}_{r}^{\prime}$ without further information. For example, $u$ and $v$ may be equally distant from $r$ , so $\mathcal{D}_{r}(u,v)=0$ , but may also be equally distant from each other. Up until Section 6, we will conservatively consider nontriviality only as a function of distances preserved between pairs in $U\times\{r\}$ to focus our attention on learning approximations of $\mathcal{D}_{r}$ and $\mathcal{D}_{r}^{\prime}$ which generalize to unseen samples. In Section 6, we return to this question and formulate the necessary properties of $\mathcal{U}$ and $\mathcal{D}$ to reason more generally about distance preservation and nontriviality.

As previously illustrated in Figure 1, submetrics constructed based on different representatives will preserve different information about the true underlying metric. We therefore consider constructing submetrics by aggregating information from multiple representatives. The following lemma and corollaries state that arbitrary mixing of submetrics, for example taking the maximum distance between a pair of elements in a set of submetrics, will result in a valid submetric and furthermore, the resulting submetric will have at most the additive error of the input submetrics.

Lemma 10.

Given a set of submetrics $\{\mathcal{D}_{i}|i\in[k]\}$ for $\mathcal{D}$ , for any arbitrary mapping $F:U\times U\rightarrow[k]$ , $\mathcal{D}_{merge}(u,v):=\mathcal{D}_{F(u,v)}(u,v)$ is a submetric of $\mathcal{D}$ .

Proof.

Notice that for all $i\in[k]$ , $\mathcal{D}_{i}(u,v)\leq\mathcal{D}(u,v)$ , and thus $\mathcal{D}_{merge}$ is also a submetric. ∎

Corollary 10.1.

Given a set of submetrics $\{\mathcal{D}_{i}|i\in[k]\}$ for $\mathcal{D}$ , define $\mathsf{maxmerge}(\{\mathcal{D}_{i}|i\in[k]\},u,v):=\max_{i\in[k]}\mathcal{D}_{i}(u,v)$ . The $\mathsf{maxmerge}$ of a set of submetrics of $\mathcal{D}$ is a submetric of $\mathcal{D}$ .

Corollary 10.2.

Given a set of $\alpha-$ submetrics $\{\mathcal{D}_{i}^{\prime}|i\in[k]\}$ for $\mathcal{D}$ , The $\mathsf{maxmerge}$ of $\{\mathcal{D}_{i}^{\prime}\}$ is an $\alpha-$ submetric of $\mathcal{D}$ .

Throughout this work, we will use the $\mathsf{maxmerge}$ of a set of representative submetrics as a core of our constructions. Below, we define $\mathcal{D}_{R}$ and $\mathcal{D}_{R}^{\prime}$ based on a set of representatives $R$ .

Definition 15 (Representative Set Submetric).

Given a set of representatives $R\subseteq U$ , we define the representative set submetric $\mathcal{D}_{R}(u,v):=\mathsf{maxmerge}(\{\mathcal{D}_{r}|r\in R\},u,v)$ and representative set consistent underestimator submetric $\mathcal{D}_{R}^{\prime}(u,v):\mathsf{maxmerge}(\{\mathcal{D}_{r}^{\prime}|r\in R\},u,v)$ for all $u,v\in U$ .

4 From human judgments to submetrics

In this section we consider the problem of determining which and how many queries to ask of our human fairness arbiter in order to construct a submetric for a pre-specified universe $U$ . This setting can be viewed either as the problem of learning a metric over a fixed universe (e.g., determining a metric over the entire set of college applicants in a particular year) or as a process for generating training data to learn a submetric which generalizes to unseen samples as in Section 5, or as input to any other method for learning fairly with access to a sample of distance, e.g. [12, 15].

Naively, we could construct $\mathcal{D}_{r}$ with $O(N)$ queries to $\mathsf{O}_{\mathsf{REAL}}$ by simply querying for every distance from the representative $r$ . Furthermore, $\mathcal{D}_{R}$ can be constructed in the same way by issuing $O(|R|N)$ queries to $\mathsf{O}_{\mathsf{REAL}}$ to constuct each representative submetric and then merging. Although the linear dependence on $N$ may seem good compared with $O(N^{2})$ , we anticipate that the cost of real-valued queries is high and increases with the number of queries. Although the number of queries is linear in $N$ , the cost in terms of human effort may not be.

We now work towards constructing a submetric from a sublinear number of real-valued queries by supplementing with $O(N\log(N))$ triplet queries, at the cost of introducing bounded additive error. Our general strategy will be to show that given an ordering consistent with the metric, we can learn a submetric from a constant or sublinear number of queries to $\mathsf{O}_{\mathsf{REAL}}$ by rounding distances from the each representative down to fixed thresholds. More concretely, a representative consistent ordering for $r$ is an ordered list of elements from smallest to largest distance from $r$ .

Definition 16 (Representative-consistent ordering).

An ordering $\mathcal{O}=\{r,x_{1},x_{2},\ldots\}$ of elements of $U$ is consistent with the representative element $r$ with respect to the metric $\mathcal{D}$ if for all $i<j$ , $\mathcal{D}(r,x_{i})\leq\mathcal{D}(r,x_{j})$ .

Given the notion of a representative consistent ordering, we now show that rounding down to “threshold” distances at granularity $\alpha$ is sufficient to produce an $\alpha-$ submetric. Threshold rounding is also very useful for preserving distances and can be helpful for generalization, as we will see in Sections 5 and 6. We now formally define a Threshold Consistent Underestimator and prove a bound on the maximum contraction and additive error.

A threshold consistent underestimator is the function which rounds down the distance between and element $x\in U$ and a fixed representative $r$ to the nearest threshold in a prespecified set.

Definition 17 ( $\alpha-$ consistent threshold underestimator).

Given a universe $U$ , a metric $\mathcal{D}:U\times U\rightarrow[0,1]$ , a representative $r\in U$ , and an ordered set of distinct thresholds $\mathcal{T}=\{0,t_{1},\ldots,t_{k}\}$ (for constant $k$ ) where $t_{i}\in[0,1]$ ,

[TABLE]

is the threshold consistent underestimator wrt $\mathcal{D},r,$ and $\mathcal{T}$ . We refer to the maximum distance between any adjacent thresholds in $\mathcal{T}$ as $\alpha_{\mathcal{T}}:=\max_{i\in[|\mathcal{T}|-1]}t_{i+1}-t_{i}$ .

It is simplest to consider $\mathcal{T}$ to consist of a set of evenly spaced thresholds at granularity $\alpha_{\mathcal{T}}$ , although the analysis does not depend on this and certainly allows varied threshold spacing. Lemma 11 formally states that an $\alpha_{\mathcal{T}}$ -consistent threshold underestimator $f_{r}^{\mathcal{T}}$ is an $\alpha_{\mathcal{T}}$ -consistent underestimator that has contraction of distances between an element and the representative $r$ of at most $c_{max}=\alpha_{\mathcal{T}}$ .

Lemma 11.

Given $f_{r}^{\mathcal{T}}(u)$ for $u\in U$ , an $\alpha_{\mathcal{T}}-$ consistent underestimator, where $\alpha_{\mathcal{T}}=\max_{i\in[|\mathcal{T}|-1]}t_{i+1}-t_{i}$ . The submetric $\mathcal{D}_{r}^{\prime}:=|f_{r}^{\mathcal{T}}(u)-f_{r}^{\mathcal{T}}(v)|$ , is an $\alpha_{\mathcal{T}}-$ submetric with maximum contraction with respect to $\mathcal{D}_{r}$ bounded by $\alpha_{\mathcal{T}}$ , ie $\mathcal{D}_{r}-\mathcal{D}_{r}^{\prime}\leq\alpha_{\mathcal{T}}$ .

Proof.

By definition, $f_{r}^{\mathcal{T}}(u)$ satisfies conditions 1 and 2 of a consistent underestimator. Notice that $u$ and $v$ have distances from $r$ reduced from $\mathcal{D}(r,u)$ and $\mathcal{D}(r,v)$ by rounding down by at most $\alpha_{\mathcal{T}}$ , and thus $|f_{r}^{\mathcal{T}}(u)-f_{r}^{\mathcal{T}}(v)|\in[\mathcal{D}_{r}(u,v)-\alpha_{\mathcal{T}},\mathcal{D}_{r}(u,v)+\alpha_{\mathcal{T}}]$ , satisfying the third condition of a consistent underestimator and the bound on the maximum contraction of $\mathcal{D}_{r}^{\prime}$ .

∎

This property of threshold consistent under estimators implies that if we can construct an ordering of the elements with respect to their distance from the representative and then label the elements at regular intervals, then we can produce a consistent underestimator.

4.1 Constructing metric consistent orderings

We can construct a metric consistent ordering by using $\mathsf{O}_{\mathsf{TRIPLET}}$ as a comparator, as $\mathsf{O}_{\mathsf{TRIPLET}}(r,x,y)$ indicates which of $x$ or $y$ has greater distance from $r$ . Using such a comparator, we can build an ordered list via binary search.151515In practice, there may be many other simple to evaluate query types which can also be used to produce an ordering. We focus on Triplet Queries as they have some existing usage in the literature, but these results can generalize to any query type which can be used to generate an ordering or as a comparator. This procedure is detailed in Algorithm 3.

Lemma 12.

Given a universe $U$ and a representative $r\in U$ , Algorithm 3 produces a representative consistent ordering $L$ for $r$ from $O(N\log(N))$ queries to $\mathsf{O}_{\mathsf{TRIPLET}}$ .

The proof follows from a straightforward analysis of the binary search procedure with $\mathsf{O}_{\mathsf{TRIPLET}}$ used for comparisons.

Proof.

We consider correctness and query complexity separately.

Query complexity.

Notice that $\mathsf{BinaryInsert}$ is called $N$ times, once for each element. Each recursive call to $\mathsf{BinaryInsert}$ eliminates at least half of the sets in $L$ under consideration, and so $\mathsf{BinaryInsert}$ has recursion depth of $O(\log(N))$ . Each recursive call to $\mathsf{BinaryInsert}$ makes a single call to $\mathsf{O}_{\mathsf{TRIPLET}}$ . Thus the total number of queries to $\mathsf{O}_{\mathsf{TRIPLET}}$ is $O(N\log(N))$ .

Correctness.

Each element is inserted into an ordered list via binary search, and as such every element earlier in the list is at least as close to $r$ as any element later in the list. ∎

4.2 Constructing $\alpha-$ submetrics from orderings

Algorithm 4, below, outlines the process of labeling and ordering by distance from the representative at a particular granularity, $\alpha$ . Algorithm 4 repeatedly splits the input ordering into contiguous ranges of elements until the difference in distances between the first and last elements in the range to the representative are at most $\alpha$ . Once each range has reached the appropriate size, the distance between each element in the range and the representative is then set to the minimum distance in its range, which maintains a weak ordering of distances from the representative and corresponds to rounding $\mathcal{D}(r,x)$ down by no more than $\alpha$ .161616All arrays are indexed from [math] in all algorithms.

Lemma 13 states that given a representative consistent ordering, an $\alpha-$ submetric can be constructed via Algorithm 4 with $O(\max\{\frac{1}{\alpha},\log(N)\})$ queries to $\mathsf{O}_{\mathsf{REAL}}$ . Algorithm 4 utilizes the representative consistent ordering to make fewer queries to $\mathsf{O}_{\mathsf{REAL}}$ by labeling elements in the ordering with distances at granularity $\alpha$ from $r$ and rounding intermediate elements to produce an $\alpha-$ consistent threshold underestimator, which is then used to construct an $\alpha-$ submetric.171717In Algorithm 4 we use “Set” and “Initialize” to mean setting or initializing a global copy of $f_{r}$ to avoid tedious bookkeeping. We also use $\mathsf{MidpointOf}$ to specify the midpoint function, which chooses the midpoint for odd length lists and rounds down for even length lists.

Lemma 13.

Given a universe $U$ , and an ordering $\mathcal{O}$ consistent with a representative $r\in U$ for a metric $\mathcal{D}$ , Algorithm 4 produces an $\alpha-$ submetric of $\mathcal{D}$ which preserves the distance between each element $u\in U$ and $r$ (with additive error $\alpha$ ) from $O(\max\{\frac{1}{\alpha},\log(N)\})$ queries to $\mathsf{O}_{\mathsf{REAL}}$ .

Proof.

We address query complexity and error magnitude separately for clarity.

Query complexity. At most 2 queries to $\mathsf{O}_{\mathsf{REAL}}$ are made per call of $\mathsf{SplitList}$ . Thus to analyze query complexity, it is sufficient to analyze the number of calls to $\mathsf{SplitList}$ . There are three conditions in which $\mathsf{SplitList}$ makes additional recursive calls:

$\mathsf{SplitList}$ makes two calls which immediately terminate, i.e. both sides of the split represented ranges with $d_{top}-d_{bottom}\leq\alpha$ . 2. 2.

$\mathsf{SplitList}$ makes two calls which do not immediately terminate, i.e. both sides of the split represented ranges with $d_{top}-d_{bottom}>\alpha$ . 3. 3.

$\mathsf{SplitList}$ makes one immediately terminating call and one not immediately terminating call, i.e. one side of the split represented a range of $>\alpha$ and the other $\leq\alpha$

Notice that at any point we have identified some number of ranges of size at least $\alpha$ , call this number $k$ , and some number of elements left to be labeled, call this number $m$ . There are at most $\frac{1}{\alpha}+1$ disjoint continuous ranges of at least size $\alpha$ in $[0,1]$ , so $k\leq\frac{1}{\alpha}+1$ . Likewise, for additional calls to be made $m$ must be greater than [math].

Consider how each call type changes $k$ and $m$ :

Type 1 calls decrease $m$ by $\frac{N}{2^{i}}$ where $i$ is the recursion depth of the call, as every element in the current range will be labeled in the next step, and may increase $k$ by at most 1. 2. 2.

Every type 2 call increases $k$ by 1, as an existing range of size $>\alpha$ is split into two disjoint ranges of size $>\alpha$ . 3. 3.

Type 3 calls decrease $m$ by $\frac{N}{2^{i+1}}$ , as $\frac{1}{2}$ of the current range will be labeled in the next step.

If all of the calls to $\mathsf{SplitList}$ are type 1 or 2, then at most $O(\frac{1}{\alpha})$ calls are made as there are at most $O(\frac{1}{\alpha})$ disjoint continuous ranges of length $\alpha$ in $[0,1]$ . If all calls to $\mathsf{SplitList}$ are type 1 or type 3, then at most $\log(N)$ calls are made as at least half of the elements in the range are labeled by the subsequent terminating call(s).

If there are a mix of all three types, notice that there can still be at most $O(\frac{1}{\alpha})$ calls in the entire recursive tree of type 2. Thus it remains to consider how mixing type 3 calls with type 2 calls impacts the total number of calls.

As a warm-up, suppose a type 3 call is issued at depth $i$ . If all of its children are type 1 or 3 calls, then there can be at most $O(\log(\frac{N}{2^{i+1}}))$ children, as the parent call and each child call labels at least $\frac{1}{2}$ of its range. Now suppose a type 2 call is issued at depth $i$ . If its children are all type 1 or 3 calls, then there can be at most $O(2\log(\frac{N}{2^{i+1}}))$ children, as the parent call spawns two sub-trees with initial size $\frac{N}{2^{i+1}}$ as opposed to just one. Therefore we know the worst case sequence of calls includes both type 2 calls and type 3 calls.

We now show that the worst case recursion tree has no type 2 calls as children of type 3 calls. Suppose for the sake of contradiction that a valid recursion tree $T$ has a node $A$ at depth $i$ of type 3 with a child $B$ of type 2. Recall that type 3 nodes have only one child which does not immediately terminate, but type 3 nodes have two. Call $B$ ’s children $B_{1}$ and $B_{2}$ . At depth $i$ , $m$ increases by $\frac{N}{2^{i+1}}$ , as half of its elements are set to be labeled by the type 3 node A. At depth $i+1$ , no additional elements are set to be labeled, but $k$ increases to $k+1$ .

Now consider an alternative tree, $T^{\prime}$ which is identical to $T$ in every way except: (1) Node $A$ is changed to type 2, (2) Node $A$ has two new children $A^{\prime}_{1}$ and $A^{\prime}_{2}$ of type 3, (3) $A^{\prime}_{1}$ ’s non-terminating child is $B_{1}$ and $A^{\prime}_{2}$ ’s non-terminating child is $B_{2}$ . At depth $i$ , $k$ increases to $k+1$ . At depth $i+1$ , $A^{\prime}_{1}$ and $A^{\prime}_{2}$ each set $\frac{N}{2^{i+2}}$ elements to be labeled, so $m$ increases by $\frac{N}{2^{i+1}}$ . Thus, $T^{\prime}$ is a valid recursion tree, but it exceeds the number of calls in $T$ by one.

Thus the worst case recursion tree will have some constant number of type 2 nodes in the highest levels which transition to type 3 and 1 nodes in the deeper levels. Suppose the type 2 nodes reach depth $\rho$ , where $2^{\rho}$ is bounded by $O(\frac{1}{\alpha})$ as the number of type 2 nodes is a constant bounded by $O(\frac{1}{\alpha})$ . Then there will be $2^{\rho}$ nodes at depth $\rho$ with $\frac{N}{2^{\rho}}$ elements in the range of each node. Each node can have at most $O(2\log(\frac{N}{2^{\rho+1}}))$ type 1 or 3 descendants, so the total number of nodes in the recursion tree is $O(2^{\rho+1}(\log(N)-\log(2^{\rho+1}))=O(\log(N))$ .

However, the worst case analysis above must consider the most pathological cases. Notice that for every type 3 query made, there must have been half of the elements in the range in a clump of distances from $r$ with less than $\alpha$ difference and the other half with distance greater than $\alpha$ . If distances from each representative are distributed more smoothly, then this is unlikely to happen too many times.

Overestimate Error

To reason about the error, notice that $f_{r}(x_{i})\in[\mathcal{D}(r,x_{i})-\alpha,\mathcal{D}(r,x_{i})]$ , as each element’s distance from $r$ is rounded down by at most $\alpha$ . Thus $f_{r}$ is an $\alpha-$ consistent underestimator (and also a threshold consistent underestimator) and the final construction of $\mathcal{D}_{r}^{\prime}$ is an $\alpha-$ submetric by Lemma 8. ∎

The primary benefit of a sublinear number of queries to $\mathsf{O}_{\mathsf{REAL}}$ is that the human fairness arbiter needs to maintain consistency with a smaller set of previous outputs. Furthermore, human fairness arbiters may only be able to answer real-valued queries to within some minimum granularity, and stating the granularity up front may help them avoid wasting time verifying the consistency of ultimately inconsequentially small distance adjustments.181818See Section 7 For more complete treatment of a model in which the arbiter has limited distinguishing power. The following theorem, which states that a representative consistent underestimator submetric $\mathcal{D}_{r}^{\prime}$ can be constructed in a sublinear number of queries to $\mathsf{O}_{\mathsf{REAL}}$ and $O(N\log(N))$ queries to $\mathsf{O}_{\mathsf{TRIPLET}}$ , is an immediate consequence of Lemmas 12 and 13.

Theorem 14.

Given access to $\mathsf{O}_{\mathsf{REAL}}$ and $\mathsf{O}_{\mathsf{TRIPLET}}$ , an $\alpha-$ submetric can be constructed from $O(\max\{\frac{1}{\alpha},\log(N)\})$ queries to $\mathsf{O}_{\mathsf{REAL}}$ and $O(N\log(N))$ queries to $\mathsf{O}_{\mathsf{TRIPLET}}$ which preserves distances (up to the additive error) from a representative $r$ .

As before, we can also expand the expressiveness of the submetric by using $\mathsf{maxmerge}$ , while still maintaining the same small additive error bound. Naively, this could be accomplished in $O(|R|\max\{\frac{1}{\alpha},\log(N)\})$ queries to $\mathsf{O}_{\mathsf{REAL}}$ given orderings for a set of representatives $R$ by applying Algorithm 4 independently on each representative’s ordering. However, the linear dependence on $|R|$ can be improved by using our third query type, quad queries.

To see this, notice that the orderings, $\{\mathcal{O}_{r}|r\in R\}$ can be merged into a single ordering by distance from representative using quad queries. To compare two elements from different lists, $\mathsf{O}_{\mathsf{QUAD}}((r_{i},x),(r_{j},y))$ will suffice to determine which is closer to its respective representative. Thus, we can use any standard sorted list merging approach to combine the sorted lists with respect to each specific representative $\{\mathcal{O}_{r}|r\in R\}$ into a single sorted list $\mathcal{O}_{R}$ of (element, representative) pairs sorted by distance of the element from its corresponding representative with $O(|R|N\log(|R|))$ queries to $\mathsf{O}_{\mathsf{QUAD}}$ . The logic of Algorithm 4 operating on this list of pairs goes through unchanged except for the representative used in the query to $\mathsf{O}_{\mathsf{REAL}}$ , and some bookkeeping to separate the labeled and rounded list of pairs back into individual representative orderings. Algorithm 5 outlines this process.

The following theorem summarizes this combined result and states that given a set of representatives $R$ , $\mathcal{D}_{R}^{\prime}$ can be constructed with $O(|R|N\log(N))$ queries to $\mathsf{O}_{\mathsf{TRIPLET}}$ , $O(|R|N\log(|R|))$ queries to $\mathsf{O}_{\mathsf{QUAD}}$ and $O(\log(|R|N))$ queries to $\mathsf{O}_{\mathsf{REAL}}$ .

Theorem 15.

Given a set of representatives $R$ and access to $\mathsf{O}_{\mathsf{REAL}},\mathsf{O}_{\mathsf{TRIPLET}},$ and $\mathsf{O}_{\mathsf{QUAD}}$ , an $\alpha-$ submetric can be constructed from $O(\log(|R|N))$ queries to $\mathsf{O}_{\mathsf{REAL}}$ , $|R|N\log(N)$ queries to $\mathsf{O}_{\mathsf{TRIPLET}}$ and $|R|N\log(|R|)$ queries to $\mathsf{O}_{\mathsf{QUAD}}$ which preserves distances (up to the additive error) from the set of representatives $R$ .

The proof of Theorem 15 follows from a straightforward analysis of list merging, detailed in Algorithm 5.

Proof.

A representative consistent ordering for each $r\in R$ can be constructed via Algorithm 3 in $O(N\log(N))$ queries to $\mathsf{O}_{\mathsf{TRIPLET}}$ for each representative, $O(|R|N\log(N))$ queries to $\mathsf{O}_{\mathsf{TRIPLET}}$ total.

Algorithm 5 given such a set of orderings constructs $\mathcal{D}_{R}^{\prime}$ , an $\alpha-$ submetric which preserves distances up to the additive error from each representative $r\in R$ . The modified version of $\mathsf{OrderingToSubmetric}$ still requires $O(\log(|R|N))$ queries to $\mathsf{O}_{\mathsf{REAL}}$ as the list length has increased to $|R|N$ . Thus all that remains to prove the theorem is to reason about the number of queries to $\mathsf{O}_{\mathsf{QUAD}}$ . Algorithm 5 makes $N|R|$ calls to $\mathsf{BinaryInsert}$ as each element in each ordering is inserted at most once into a list of length at most $|R|$ . Each $\mathsf{BinaryInsert}$ into a list of length $k$ requires $O(\log(k))$ queries to $\mathsf{O}_{\mathsf{QUAD}}$ as a comparator. Thus the total number of queries to $\mathsf{O}_{\mathsf{QUAD}}$ is $O(|R|N\log(|R|))$ . ∎

Summary

In this section, we have shown how to use $O(\log(N))$ real-valued queries and $O(N\log(N))$ triplet queries in order to construct nontrivial representative submetric for a fixed universe of $N$ individuals. When learning multiple representative submetrics, we have also shown how to improve the naive linear dependency on the number of representatives to logarithmic by supplementing with a $O(|R|N\log(|R|)$ quad queries and $O(|R|N\log(N))$ triplet queries.

In the next section (5), we will show how to construct generalizable representative submetrics, i.e., how to predict what human fairness arbiters “would have said” on unseen examples. In the following section (6), we tackle how to choose a small set of representatives to improve nontriviality guarantees.

5 Generalization

In this section, we consider the problem of learning how to predict the human fairness arbiter’s judgments on unseen samples from $\mathcal{U}$ . (We consider how to pick the set of representatives in Section 6.) In particular, we will consider the problem of generalizing a representative submetric to fresh samples from $\mathcal{U}$ . Our goal is to construct efficient learners for submetrics as in Valiant’s Probably Approximately Correct (PAC) model of learning [18]. However, we do not want to be too prescriptive about the submetric concept class, particularly about the representation of elements in the universe. Instead, we will make an assumption about the learnability of threshold functions (Definition 18) and construct learning procedures for submetrics using threshold functions as building blocks without any additional direct access to labeled or unlabeled samples from $\mathcal{U}$ .

We restate the formal definition of an efficient submetric learner below.191919This goal of learning a hypothesis that with high probability, does not exceed distances on most pairs in $\mathcal{U}\times\mathcal{U}$ is almost identical to Rothblum and Yona’s notion of “approximately metric-fair” [15].

Definition 9 (Efficient Submetric Learner - Restatement).

We say that a learning procedure is an efficient $\alpha-$ submetric learner if for any error and failure probability parameters $\varepsilon,\delta\in(0,1]$ , given access to labeled examples of $\mathcal{D}(r,x\sim\mathcal{U})$ , with probability at least $1-\delta$ over the randomness of the sampling and the learning procedure produces a hypothesis $h_{r}:U\times U\rightarrow[0,1]$ such that

[TABLE]

in time $O(poly(\frac{1}{\varepsilon},\frac{1}{\delta}))$ .

In our formal definition, we are again purposefully vague about the type of the labeled examples, and all of our subsequent constructions will use labeled examples for the threshold functions and set $\alpha$ corresponding to the maximum difference between adjacent thresholds. Whenever we use a set of ordered thresholds, $\mathcal{T}$ , we will write $\alpha_{\mathcal{T}}=\max_{t_{i}\in\mathcal{T}}\{t_{i}-t_{i-1}\}$ to denote the maximum difference between adjacent thresholds.

In the remainder of this section, we formalize the relatively weak assumption that there exist a set of efficient learners for a set of binary threshold functions (Definition 18). Second, we show the construction of an efficient learner for a submetric with additive error dependent on the set of thresholds based on voting by hypotheses produced by each threshold function learner. Finally, we show how to combine a set of learners for submetrics as a first step to improving nontriviality as a warm-up for Section 6.

5.1 Learnability of threshold functions

Assumption 1 (restated below for clarity) states that for every representative, there exists a set of thresholds and a learner for each threshold in the set which, with high probability202020In the remainder of this work, when we state that a learner produces a hypothesis with high probability, we will always take the probability over the randomness of the sampling and the learning procedure., produces an accurate hypothesis for the threshold function for each threshold in the set which generalizes to unseen samples. We first formally define a threshold function, which is a binary indicator of whether a particular element $u\in U$ is within distance $t$ of $r$ for a threshold $t\in[0,1]$ and a representative $r$ , and then restate the learnability assumption.

Definition 18 (threshold function).

A threshold function $T_{t}^{r}(u):U\rightarrow\{0,1\}$ is defined

[TABLE]

with respect to a representative $r$ and metric $\mathcal{D}$ .

Assumption 1 (Restatement).

Given a metric $\mathcal{D}$ and a representative $r$ , there exists a set of thresholds $\mathcal{T}$ such that

$t\in[0,1]$ for all $t\in\mathcal{T}$ , 2. 2.

$0\in\mathcal{T}$ , 3. 3.

$\alpha_{\mathcal{T}}=\max_{i\in[|\mathcal{T}|-1]}t_{i+1}-t_{i}$ , 4. 4.

$|\mathcal{T}|=O(1)$ ,

and for every $t\in\mathcal{T}$ there exists an efficient learner $L_{t}^{r}(\varepsilon_{t},\delta_{t})$ which for all $\varepsilon_{t},\delta_{t}\in(0,1]$ , with probability at least $1-\delta_{t}$ over the randomness of the sample and the learning procedure produces a hypothesis $h_{t}^{r}$ such that

[TABLE]

in time $O(poly(\frac{1}{\varepsilon_{t}},\frac{1}{\delta_{t}}))$ with access to labeled samples of $T_{t}^{r}(u\sim\mathcal{U})$ for any distribution $\mathcal{U}$ over the universe. That is, the concept class $T_{t}^{r}$ is efficiently learnable for all $t\in\mathcal{T}$ .

As noted before, we are intentionally vague about the representation of $U$ because we tuck any issues of representation away into the assumption of learnability of threshold functions. All of our subsequent constructions will only interact with samples from $\mathcal{U}$ through the learners for the threshold functions, and as such, the representation can be completely abstracted away. In Assumption 1, the choice of $r$ is also not explicitly specified. In this work, we will take Assumption 1 to apply to every $r\in U$ .

5.2 Constructing submetric learners from threshold learners

Given Assumption 1, our next step is to determine how to combine the threshold learners into a learner for the threshold consistent underestimator for $r$ with respect to $\mathcal{T}$ which can be post-processed into an $\alpha_{\mathcal{T}}$ submetric.

We first show how to combine a set of hypotheses for threshold functions into a hypothesis for a threshold consistent underestimator. The $\mathsf{LinearVote}$ mechanism, redefined below, takes in a set of hypotheses for the thresholds and outputs the threshold that the most hypotheses agree with.

Definition 10 ( $\mathsf{LinearVote}$ - Restatement).

Given an ordered set of thresholds, $\mathcal{T}=\{t_{1},t_{2},\ldots,t_{|T|}\}$ , and a set of hypotheses $H_{\mathcal{T}}^{r}=\{h_{t_{1}}^{r},h_{t_{2}}^{r},\ldots,h_{t_{|T|}}^{r}\}$ , one corresponding to each threshold function, $\mathsf{LinearVote}$ outputs the threshold that the most hypotheses agree with.

[TABLE]

$\mathsf{LinearVote}$ is equivalent to $f_{r}^{\mathcal{T}}$ (Definition 17) when all of the $h_{t_{i}}^{r}$ output the correct value.

Algorithm 6 takes as input a set of thresholds and learners for those thresholds and (1) calls these learners with appropriately scaled parameters (2) and combines the resulting hypotheses via $\mathsf{LinearVote}$ to produce a hypothesis $h_{r}$ for the $\alpha_{\mathcal{T}}$ -submetric $\mathcal{D}_{r}^{\prime}(x,y):=|f_{r}^{\mathcal{T}}(x)-f_{r}^{\mathcal{T}}(y)|$ .212121Algorithms 6, 7, and 8 are all invoked with error and failure parameters. To keep the parameter names clear, we refer to $\varepsilon_{t}$ and $\delta_{t}$ for the threshold function learners; $\varepsilon_{r}$ and $\delta_{r}$ for the single representative submetric learner Algorithm 6; $\varepsilon_{R}$ and $\delta_{R}$ for the combined representative set submetric learner Algorithm 7; and $\varepsilon$ and $\delta$ for the complete learning procedure Algorithm 8. Each time a learning procedure is invoked, we specify the relevant parameters using these variables. In Algorithms 6, 7, 8, we implicitly assume access to labeled samples of $T_{t}^{r}(u\sim\mathcal{U}).$ Sample complexity is explicitly analyzed in Theorem 20.

Theorem 16 states that given a set of learners as specified in Assumption 1, Algorithm 6 will produce a hypothesis for the $\alpha_{\mathcal{T}}-$ submetric $\mathcal{D}_{r}^{\prime}(x,y):=|f_{r}^{\mathcal{T}}(x)-f_{r}^{\mathcal{T}}(y)|$ with probability at least $1-\delta_{r}$ with error at most $\varepsilon_{r}$ .

Theorem 16.

Under Assumption 1, there exists an efficient $\alpha_{\mathcal{T}}-$ submetric learner. That is, given a representative $r$ , a distance metric $\mathcal{D}$ , a distribution $\mathcal{U}$ over the universe, and a set of a constant number of thresholds $\mathcal{T}$ , if there exists a set of efficient learners $L=\{L_{t_{i}\in\mathcal{T}}^{r}\}$ as specified in Assumption 1, then there exists an efficient learner which produces a hypothesis $h_{r}:U\rightarrow[0,1]$ such that $\Pr_{x,y\sim\mathcal{U}\times\mathcal{U}}[|h_{r}(x,y)-\mathcal{D}_{r}^{\prime}(x,y)|\geq\alpha_{\mathcal{T}}]\leq\varepsilon_{r}$ with probability at least $1-\delta_{r}$ for all $\varepsilon_{r}$ , $\delta_{r}\in(0,1]$ in time $O(poly(\frac{1}{\varepsilon_{r}},\frac{1}{\delta_{r}},|\mathcal{T}|))$ , where $\mathcal{D}_{r}^{\prime}(x,y):=|f_{r}^{\mathcal{T}}(x)-f_{r}^{\mathcal{T}}(y)|$ for all $x,y\in U\times U$ .

Proof.

Consider the construction of $h_{r}(x)$ as specified in Algorithm 6. The failure probability of Algorithm 6 is $\delta_{r}$ by union bound, as the procedure only fails if at least one of the learners in $L$ failed to produce an $\frac{\varepsilon_{r}}{2|\mathcal{T}|}-$ good hypothesis for $T_{t_{i}}^{r}$ . As each learner in $L_{t_{i}}^{r}$ runs in time $O(poly(\frac{1}{\varepsilon_{t}}),O(\frac{1}{\delta_{t}}))$ by Assumption 1, running all $|\mathcal{T}|$ learners takes time $O(poly(\frac{1}{\varepsilon_{r}},\frac{1}{\delta_{r}},|\mathcal{T}|))$ , as Algorithm 6 invokes each $L_{t_{i}}^{r}$ with $\varepsilon_{t},\delta_{t}$ scaled by a factor of $\frac{1}{|\mathcal{T}|}$ . Recall to satisfy the definition of an efficient learner (Definition 9), that Algorithm 6 must run in time $O(\frac{1}{\varepsilon_{r}},\frac{1}{\delta_{r}})$ . Given that $|\mathcal{T}|$ is constant, this requirement is satisfied. With respect to accuracy, notice that $h_{r}(x,y)$ only outputs a value more than $\alpha_{\mathcal{T}}$ away from $\mathcal{D}_{r}^{\prime}(x,y)$ if at least one of $h_{t_{i}}^{r}(x)$ or $h_{t_{i}}^{r}(y)$ is in error. Assuming all of the $L_{t_{i}}^{r}$ output good hypotheses, the probability that at least one of $h_{t_{i}}^{r}(x)$ or $h_{t_{i}}^{r}(y)$ is in error is at most $2\sum_{t_{i}\in\mathcal{T}}\frac{\varepsilon_{r}}{2|\mathcal{T}|}=\varepsilon_{r}$ by union bound. Thus, Algorithm 6 satisfies the conditions of the theorem. ∎

Two key properties of the proof, which will be important in our consideration of query complexity to generate the labeled samples (Theorem 20), are (1) each of the threshold function learners learns on the same distribution $\mathcal{U}$ , and (2) no independence of errors between the threshold function learners is assumed.

As in the previous section, combining information from multiple representatives can improve nontriviality guarantees. Algorithm 7 takes as input a set of learners for representative submetrics for a set of representatives $R\subseteq U$ (for example, learners based on Algorithm 6) and produces a hypothesis $h_{R}$ based on the $\mathsf{maxmerge}$ of the hypotheses produced by the input learners.

Theorem 17 states that given a set of learners for threshold functions for a set of representatives (Assumption 1), Algorithm 7 produces a hypothesis $h_{R}$ with probability at least $1-\delta_{R}$ with error at most $\varepsilon_{R}$ which approximates $\mathcal{D}_{R}^{\prime}(x,y):=\mathsf{maxmerge}(\{\mathcal{D}_{r}^{\prime}|r\in R\},x,y)$ , where the $\mathcal{D}_{r}^{\prime}$ are based on threshold consistent underestimators. In contrast to the statement of Theorem 16, which does not explicitly address nontriviality, Theorem 17 introduces a nontriviality guarantee which relies on the fraction of distances that exceed the contraction of the consistent underestimators. This additional requirement stems from the fact that consistent underestimators with contraction in the distances between the representative and other elements in $U$ will not entirely preserve the original distance.222222Note that the choice of $2\alpha_{\mathcal{T}}$ in order to preserve $\frac{1}{2}$ of the original distance is somewhat arbitrary. In Section 6 we give a parametrizable guarantee. Roughly speaking, for the nontriviality properties to hold, we need at least a $p$ -fraction of distances in the distribution to be large enough that an $\alpha_{\mathcal{T}}$ contraction of the original distance is insignificant. As we have not yet specified how representatives are chosen or how those choices preserve distances, we assume that all pairs with sufficiently large distances include at least one representative. We explicitly note the dependence on $|R|$ in the theorem statement as a placeholder until the required size for $|R|$ is established (Lemma 6).

Theorem 17.

Given a distance metric $\mathcal{D}$ , and a distribution $\mathcal{U}$ over the universe, if there exist a set of thresholds $\mathcal{T}$ and efficient learners $L=\{L_{t_{i}\in\mathcal{T}}^{r}\}$ as in Assumption 1, and weight $p$ of pairs of elements in $\mathcal{U}\times\mathcal{U}$ include at least one representative $r\in R$ and have distance greater than $2\alpha_{\mathcal{T}}$ , then there exists an efficient learner which produces a hypothesis $h_{R}$ with probability greater than $1-\delta_{R}$ such that

$\Pr_{x,y\sim\mathcal{U}\times\mathcal{U}}[h_{R}(x,y)>\mathcal{D}(x,y)+\alpha]\leq\varepsilon_{R}$ ** 2. 2.

$h_{R}$ * is * $(p-\varepsilon_{R},\frac{1}{2})-$ nontrivial for $\mathcal{U}$ .

The learner runs in time $O(poly(|\mathcal{T}|,|R|,\frac{1}{\varepsilon_{R}},\frac{1}{\delta_{R}}))$ for all $\varepsilon_{R}$ , $\delta_{R}\in(0,1]$ .

That is, under Assumption 1, if weight $p$ of pairs in $\mathcal{U}\times\mathcal{U}$ which include at least one representative in $R$ have distance greater than $2\alpha_{\mathcal{T}}$ , then there exists an efficient $(p-\varepsilon_{R},\frac{1}{2})-$ nontrivial $\alpha_{\mathcal{T}}-$ submetric learner.

Proof.

Consider Algorithm 7 parametrized with $L=\{L_{r}\}$ constructed via Algorithm 6 operating on $L=\{L_{t_{i}\in\mathcal{T}}^{r}\}$ .

Running time. Algorithm 7 makes $|R|$ calls to Algorithm 6. Algorithm 6 runs in time $O(poly(|\mathcal{T}|,\frac{1}{\varepsilon_{r}},\frac{1}{\delta_{r}}))$ , where $\varepsilon_{r}=\frac{\varepsilon_{R}}{|R|}$ and $\delta_{r}=\frac{\delta_{R}}{|R|}$ are the error and failure probability parameters with which Algorithm 7 invokes Algorithm 6. Thus Algorithm 7 runs in time $O(poly(|R|,|\mathcal{T}|,\frac{1}{\varepsilon_{R}},\frac{1}{\delta_{R}}))$ .

Failure probability. We say that Algorithm 7 has “failed” if at least one of $L_{r}$ fails to produce an $\frac{\varepsilon_{R}}{|R|}-$ good hypothesis $h_{R}$ . The failure probability of Algorithm 6 is $\leq\sum_{r\in R}\frac{\delta_{r}}{|R|}=\delta_{R}$ by union bound.

Overestimate error probability. Suppose that all of the learners in $L$ produce a good candidate $h_{r}$ with error probability $\frac{\varepsilon_{R}}{|R|}$ or less. Now, consider the probability that the result of $\mathsf{maxmerge}(H_{R},u,v)$ is an over-estimate by more than $\alpha_{\mathcal{T}}$ . This can only happen if at least one of the $h_{r}$ is in error by more than $\alpha_{\mathcal{T}}$ . Thus by union bound, the probability of over-estimate is at most $\varepsilon_{R}.$

Nontriviality. Each of the $h_{r}$ has additive and subtractive error at most $\alpha_{\mathcal{T}}$ , so for any $r\in R$ and $u\in U$ such that $\mathcal{D}(r,u)\geq 2\alpha_{\mathcal{T}}$ , at least half of the original distance will be preserved. Thus, making the worst case assumption232323We could omit the error probability $\varepsilon_{R}$ in the statement of nontriviality and leave implicit that the nontriviality guarantees “stack” with the hypothesis error probability. However, this is somewhat misleading as we cannot assume that the errors of the hypothesis are randomly distributed. that all $\varepsilon_{R}$ weight of errors result in distance underestimates on the relevant pairs, the metric learned is $(p-\varepsilon_{R},\frac{1}{2})-$ nontrivial for $\mathcal{U}$ . ∎

Notice that in the analysis of the error and failure probability for Algorithm 7, there is no particular requirement that the learners used to produce $h_{r}$ for each representative be based on thresholds. The only requirement is that the learners produce $h_{r}$ such that $\Pr_{x,y\sim\mathcal{U}}[|h_{r}(x,y)-\mathcal{D}(x,y)|\geq\alpha_{\mathcal{T}}]\leq\varepsilon_{r}$ with probability at least $1-\delta_{r}$ . Thus in settings with alternative mechanisms to produce such $h_{r}$ , they can be substituted without compromising the result. We state the following corollary to formalize this intuition.

Corollary 17.1.

Given a distance metric $\mathcal{D}$ , and a distribution $\mathcal{U}$ over the universe, if there exist a set of efficient learners $L=\{L_{r\in R}\}$ such that, given access to labeled samples, $L_{r}$ produces a hypothesis $h_{r}$ such that $\Pr_{x,y\sim\mathcal{U}\times\mathcal{U}}[|h_{r}(x,y)-\mathcal{D}(x,y)|\geq\alpha]\leq\varepsilon_{r}$ with probability at least $1-\delta_{r}$ in time $O(poly(\frac{1}{\varepsilon_{r}},\frac{1}{\delta_{r}}))$ and weight $p$ of pairs of elements in $\mathcal{U}\times\mathcal{U}$ include at least one representative $r\in R$ and have distance greater than $2\alpha_{\mathcal{T}}$ , then there exists an efficient learner which produces a hypothesis $h_{R}$ with probability greater than $1-\delta_{R}$ such that

$\Pr_{x,y\sim\mathcal{U}\times\mathcal{U}}[h_{R}(x,y)>\mathcal{D}(x,y)+\alpha]\leq\varepsilon_{R}$ ** 2. 2.

$h_{R}$ * is * $(p-\varepsilon_{R},\frac{1}{2})-$ nontrivial for $\mathcal{U}$ .

The learner runs in time $O(poly(|R|,\frac{1}{\varepsilon_{R}},\frac{1}{\delta_{R}}))$ for all $\varepsilon_{R}$ , $\delta_{R}\in(0,1]$ .

As discussed in Section 3 (Proposition 9), A submetric can be postprocessed to reduce the additive error. Corollary 17.2 below reflects the result of postprocessing, in particular the impact on the distance distribution requirements.

Corollary 17.2.

Given a distance metric $\mathcal{D}$ , and a distribution $\mathcal{U}$ over the universe, if there exist a set of efficient learners $L=\{L_{r\in R}\}$ such that, given access to labeled samples, $L_{r}$ produces a hypothesis $h_{r}$ such that $\Pr_{x,y\sim\mathcal{U}\times\mathcal{U}}[|h_{r}(x,y)-\mathcal{D}(x,y)|\geq\alpha]\leq\varepsilon_{r}$ with probability at least $1-\delta_{r}$ in time $O(poly(\frac{1}{\varepsilon_{r}},\frac{1}{\delta_{r}}))$ and weight $p$ of pairs of elements in $\mathcal{U}\times\mathcal{U}$ include at least one representative $r\in R$ and have distance greater than $2\alpha_{\mathcal{T}}+\alpha$ , then there exists an efficient learner which produces a hypothesis $h_{R}$ with probability greater than $1-\delta_{R}$ such that

$\Pr_{x,y\sim\mathcal{U}\times\mathcal{U}}[h_{R}(x,y)>\mathcal{D}(x,y)]\leq\varepsilon_{R}$ 2. 2.

$h_{R}$ * is * $(p-\varepsilon_{R},\frac{1}{2})-$ nontrivial for $\mathcal{U}$ .

The learner runs in time $O(poly(|R|,\frac{1}{\varepsilon_{R}},\frac{1}{\delta_{R}}))$ for all $\varepsilon_{R}$ , $\delta_{R}\in(0,1]$ .

Theorem 17 is the first step to learning submetrics which generalize to unseen samples, but the limited nontriviality guarantee is potentially problematic. The next section considers how the choice of representatives and the properties of the metric on the distribution $\mathcal{U}$ impact nontriviality.

6 Choosing Representatives

There are two approaches one might take to improve the nontriviality guarantee of Theorem 17: (1) develop specialized strategies for combining representative submetrics which depend on the structure of the metric, or (2) characterize generic randomized strategies. We briefly consider the first approach below, and then devote the remainder of the section to the second approach.

6.1 Metric structure dependent strategies.

First, one could propose a representative selection mechanism tailored to a particular problem setting. This is a very reasonable strategy if some structure of the metric is known which can be exploited to better combine the representative submetrics, or there are specific distance preservation properties other than nontriviality which are deemed desirable.

For example, suppose that we had some understanding that the underlying metric we wish to learn is Euclidean distance in two dimensions. Even without knowing the features relevant to each dimension, we can propose a generic “representative GPS” submetric combination procedure. We could choose $3$ representatives (with some additional conditions to ensure they form a basis) and use Algorithm 6 to learn a representative submetric $\mathcal{D}_{r}$ for each representative with reasonably small contraction which generalizes to unseen samples. These distances can be used to build up a $2-$ dimensional embedding of the representative points and any new points observed. Notice that each new point can have at most one valid position in the embedding depending on its distance from the $3$ representatives.242424This assumes that all distances are exact, there is some slack when the distance from each representative is an underestimate. We omit a full treatment of this problem in this work, both in terms of number of dimensions and approximation of representative distance, as it is not inherently important to understanding the motivation for setting specific strategies. Briefly, when distances from each representative are not exact it is possible that the region of possible locations is not contiguous for a new point. In terms of computing distances between two points of uncertain location, this can be “fixed” from an overestimate perspective by taking the minimum of the distances between all possible locations, but at the cost of weaker nontriviality guarantees. Thus for any pair $u,v\in U\times U$ we can compute their distance based on their relative positions from the set of representatives with error probability proportional to the error of our hypotheses for $\{\mathcal{D}_{r}\}$ . Essentially, with a strong assumption on the form of the metric, we may be able to propose a representative submetric combination strategy which gives very good nontriviality guarantees.

6.2 Random representatives

When little or no information is known about the structure of the metric, or the known structure dos not lend itself to a simple representative selection strategies, choosing a set of representatives at random is a reasonable alternative strategy. When a set of representatives is chosen at random, a key component of the argument for how well the set will preserve distances is how distances between pairs are distributed in $\mathcal{U}\times\mathcal{U}$ . For instance, if most of the weight in $\mathcal{U}\times\mathcal{U}$ is concentrated on pairs which are maximally distant, it may be more difficult to generate a set of good representatives compared with an alternative distribution over $U$ which results in a broader range of distances. A set of randomly chosen representatives will have certain nontriviality properties which depend on the more generic “density” properties of the metric and distribution $\mathcal{U}$ , which we define below. In contrast to a setting-specific strategy, we don’t make any assumptions about how submetrics based on different representatives can be combined other than the universally applicable merges specified in Lemma 10.

We devote the remainder of this section to understanding the generalization properties of a random set of representatives. First, we formalize the definition of a $\gamma-$ net to capture the notion of a set of representatives “covering” a fraction of the distribution (subset of the universe) and prove several useful lemmas relating the size of $\gamma$ to the nontriviality properties of the submetric. Next, we formally define the density and diffusion parameters for a metric and distribution over the universe, and show how the nontriviality properties of $\gamma-$ nets relate to these parameters. Roughly speaking, density describes how closely packed elements are and characterizes how easy it is to construct a $\gamma$ -net, whereas diffusion describes how many distances are large enough to tolerate a contraction. Intuitively, more closely packed points (high density) will make it easier to find a representative closer to those points, but the tradeoff is additional small absolute distances between points (lower diffusion), which will be more impacted by the underestimate error of the net.252525For example, a universe with all points except one clustered together with distances less than $\alpha$ will be easy to cover with representatives at distance at most $\alpha$ from each element, but any contraction of size approximately $\alpha$ will destroy any distinguishing power between the clustered points. Finally, we characterize the number of randomly sampled representatives needed to form a $\gamma-$ net, given the density and diffusion characteristics of the metric and distribution, and use this to prove our main generalization result.

6.3 Distance preservation via $\gamma-$ nets

The crux of the argument for nontriviality with random representatives is (1) a random sample of representatives is likely to be “close to” a significant portion of $\mathcal{U}$ , and (2) we can bound the magnitude of underestimates based on the distance from a representative for arbitrary metrics. Recall the definition of a $\gamma-$ net, which captures the notion of being “close to” or “covering” a set of elements.

Definition 6 (Restatement).

A set $R\subseteq U$ is said to form a $\gamma-$ net for a subset $V\subseteq U$ under $\mathcal{D}$ if for all balls of radius $\gamma$ (determined by $\mathcal{D}$ ) containing at least one element $v\in V$ , the ball also contains $r\in R$ .

To reason about nontriviality of a set of representatives which form a $\gamma-$ net, we derive a bound on the contraction of distances between pairs based on their distances to a representative. Intuitively, the distance between a representative and another element in the universe will be nearly identical to the distance between a close neighbor of the representative and that element. Lemma 3 (restated below) states that, given a representative $r$ , $\mathcal{D}_{r}$ underestimates $\mathcal{D}(u,v)$ by at most $\min\{2\mathcal{D}(r,u),2\mathcal{D}(r,v)\}$ .

Lemma 3 (Restatement).

For all $u,v\in U\backslash\{r\}$ , $\mathcal{D}(u,v)-\mathcal{D}_{r}(u,v)\leq\min\{2\mathcal{D}(r,u),2\mathcal{D}(r,v)\}$ , where $\mathcal{D}_{r}$ is the representative submetric for $r\in U$ .

Proof.

By construction, $\mathcal{D}_{r}(u,v)=|\mathcal{D}(r,v)-\mathcal{D}(r,u)|.$ Without loss of generality, assume $\mathcal{D}(r,u)\leq\mathcal{D}(r,v).$ By triangle inequality, $\mathcal{D}(r,v)\geq\mathcal{D}(u,v)-\mathcal{D}(r,u)$ , so $\mathcal{D}(u,v)-\mathcal{D}_{r}(u,v)\leq 2\mathcal{D}(r,u)$ . ∎

Corollary 17.3.

*For all $u,v\in U\backslash\{r\}$ , $\mathcal{D}(u,v)-\mathcal{D}_{r}^{\prime}(u,v)\leq\min\{2\mathcal{D}(r,u),2\mathcal{D}(r,v)\}+\alpha$ , where $\mathcal{D}_{r}^{\prime}$ is the consistent underestimator representative submetric for $r\in U$ with maximum contraction $\alpha$ . *

Lemma 3 is very useful for understanding the distance contractions for sets of representatives which form $\gamma-$ nets for $U$ , as every pair is close to at least one representative. Of course, forming a $\gamma-$ net for an arbitrary $\gamma$ isn’t enough on its own to give a good nontriviality guarantee.262626 For example, if all of the elements in $U$ are contained in two well separated balls of radius $\gamma$ , a $\gamma-$ net will preserve distances between pairs with one element in each ball well, but distances between pairs within the same ball may not be. This issue is a significant motivation for defining nontriviality as a relative distance preservation guarantee, rather than an absolute maximum contraction. Notice that the absolute contraction in this case is potentially very small, only $2\gamma$ , but the relative contraction may be significantly higher for pairs contained in the same ball. Later applications seeking to use the submetric as constraints on a classifier will not be able to make nuanced decisions between elements in the same ball, which may be problematic for some settings.

6.4 Density and diffusion

To understand how representatives which form a $\gamma-$ net will preserve distances, we recall the definitions of density and diffusion below to characterize the relevant properties of the metric and distribution. The notion of $(\gamma,a,b)-$ dense is intended to capture the weight ( $a$ ) of elements that have a significant weight ( $b$ ) on their close (distance $\gamma$ ) neighbors under $\mathcal{U}$ as a way to characterize how likely it is that a randomly chosen representative will be $\gamma$ -close to a significant fraction of elements.

Definition 7 ( $(\gamma,a,b)-$ dense - Restatement).

Given a distribution $\mathcal{U}$ over the universe $U$ , a metric $\mathcal{D}:U\times U\rightarrow[0,1]$ is said to be $(\gamma,a,b)-$ dense for $\mathcal{U}$ if there exists a subset $A\subseteq U$ with weight $a$ under $\mathcal{U}$ such that for all $u\in A$

[TABLE]

Figure 2 illustrates the tradeoff between $a$ and $b$ for a particular choice of $\gamma$ for $\mathcal{U}^{*}$ on an example universe in $\mathbb{R}^{2}$ .

In addition to density, we will also frequently consider the fraction of distances larger than a given constant. This allows us to reason about how much the contraction in the submetric will affect the distances preserved, as in the statement of Theorem 17. This notion is formalized as diffusion.

Definition 8 ( $(p,\zeta)-$ diffuse- Restatement).

Given a distribution $\mathcal{U}$ , a metric $\mathcal{D}$ is $(p,\zeta)-$ diffuse if the fraction of distances between pairs of elements in $\mathcal{U}\times\mathcal{U}$ greater than $\zeta$ is $p$ , ie

[TABLE]

Definition 8 is highly reminiscent of nontriviality (Definition 5) and we formally relate diffusion to nontriviality in Lemma 18. Notice that, although there are five parameters describing a metric and distribution across the two definitions, these parameters are highly related. We will generally consider distributions which are $(\gamma,a,b)-$ dense and $(p,\frac{2\gamma}{1-c})-$ diffuse. Although $\frac{2\gamma}{1-c}$ initially seems an arbitrary quantity, it indicates that a $p-$ fraction of pairs will have distances preserved by a factor of $c$ if the maximum contraction for those pairs is no more than $2\gamma$ . Thus the values of $\gamma$ and $c$ , which in turn dictate $p$ , $a$ , and $b$ , (assuming $\zeta=\frac{2\gamma}{1-c}$ ) can loosely be seen as a tradeoff between how many pairs will have distance preservation guarantees and how large the guarantees will be. In the case of the example in Figure 2, we could describe the uniform distribution as $(.88,.4)-$ diffuse, or $(.88,\frac{2\gamma}{1-c})-$ diffuse, where $c=\frac{1}{2}$ and $\gamma=0.1$ . That is, with contraction $2\gamma$ at least 88% of the pairs in $\mathcal{U}^{*}\times\mathcal{U}^{*}$ will have at least half of their distances preserved.

6.4.1 Nontriviality properties of $\gamma-$ nets

Given the formalization of diffusion, we can now relate the magnitude of $\gamma$ to the nontriviality properties of the merged representative set submetric. Lemma 18 states that a set of representatives which form a $\gamma-$ net for $U$ will have nontriviality properties related to the diffusion properties of $\mathcal{D}$ .

Lemma 18.

If a set of representatives $R\subseteq U$ form a $\gamma-$ net for a universe $U$ and $\mathcal{D}$ is $(p,\frac{2\gamma}{1-c})-$ diffuse on $\mathcal{U}$ , then $\mathcal{D}_{R}$ is $(p,c)-$ nontrivial on $\mathcal{U}$ .

Proof.

Recall from the proof of Lemma 3 that the distance between a pair $\mathcal{D}_{r}(u,v)$ has contraction at most $\min\{2\mathcal{D}(r,v),2\mathcal{D}(r,u)\}$ . Thus, the distance between any pair of elements is contracted by at most $2\gamma$ . A $p$ fraction of distances between pairs are greater than $\frac{2\gamma}{1-c}$ , so an absolute contraction of $2\gamma$ for these elements yields a ratio of at least $\frac{2\gamma/(1-c)-2\gamma}{\frac{2\gamma}{1-c}}=1-\frac{2\gamma}{\frac{2\gamma}{1-c}}=c$ and thus a $2\gamma$ absolute contraction is at most a $c$ relative contraction for this set of elements. So we conclude that the max-merge of $\mathcal{D}_{r}$ for $r\in R$ is $(p,c)-$ nontrivial for $\mathcal{U}$ . ∎

Corollary 18.1 states that in the case of consistent underestimators with $c_{max}=\alpha^{\prime}$ that accounting for the potential underestimate error in the diffusion parameter is sufficient to yield the same nontriviality guarantees as in Lemma 18. Corollary 18.1 follows from observing the maximum possible contraction due to the underestimation from the $\gamma-$ net placement and the underestimation of the consistent underestimators.

Corollary 18.1.

If a set of representatives $R\subseteq U$ form a $\gamma-$ net for a universe $U$ and $\mathcal{D}$ is $(p,\frac{2\gamma+\alpha^{\prime}}{1-c})-$ diffuse on $\mathcal{U}$ then $\mathcal{D}_{r}^{\prime}$ , produced from $\alpha-$ consistent underestimators with maximum contraction $\alpha^{\prime}$ , is $(p,c)-$ nontrivial for $\mathcal{U}$ .

Returning to the example universe from Figure 2, Lemma 18 implies that if we selected a set of representatives $R$ which formed a $0.1-$ net for the whole universe, then $\mathcal{D}_{R}$ produced from exact evaluations of $\mathcal{D}(r,u)$ for all $u\in U$ and $r\in R$ would be $(0.88,\frac{1}{2})-$ nontrivial for $\mathcal{U}^{*}$ . That is, $\mathcal{D}_{R}$ would preserve half of the original distance for almost $90\%$ of pairs in $\mathcal{U}^{*}\times\mathcal{U}^{*}$ .

We now recall and present the proof for Lemma 5, the weighted subset analog of Lemma 18, which states that if a set of representatives form a $\gamma-$ net for a subset of $U$ , then the nontriviality properties depend on the weight of that subset in $\mathcal{U}$ .

Lemma 5 (Restatement).

If a set of representatives $R\subseteq U$ form a $\gamma-$ net for weight $w$ of $\mathcal{U}$ and $\mathcal{D}$ is $(p,\frac{2\gamma}{1-c})-$ diffuse on $\mathcal{U}$ , then the submetric $\mathcal{D}_{R}$ is $(p^{\prime},c)-$ nontrivial for $\mathcal{U}$ , where $p^{\prime}\geq p-(1-w)^{2}$ .

Proof.

Consider the pairs in $\mathcal{U}\times\mathcal{U}$ which have distance at least $\frac{2\gamma}{1-c}$ . The total weight of such pairs in $\mathcal{U}\times\mathcal{U}$ is $p$ . Pairs with neither element in the net can have weight at most $(1-w)^{2}$ . Assuming the worst case scenario that all $(1-w)^{2}$ weight of pairs with neither element in the net are also pairs with distance at least $\frac{2\gamma}{1-c}$ , at least a $p^{\prime}\geq p-(1-w)^{2}$ weight in $\mathcal{U}\times\mathcal{U}$ have at least one element in the net and a distance of at least $\frac{2\gamma}{1-c}.$

By the same logic as in the proof of Lemma 18, pairs with distance at least $\frac{2\gamma}{1-c}$ have relative contraction at most $c$ if at least one member is in the $\gamma-$ net. Thus the $\mathsf{maxmerge}$ of the submetrics from representatives in $R$ is $(p^{\prime},c)-$ nontrivial for $p^{\prime}\geq p-(1-w)^{2}$ . ∎

Corollary 5.1 restates Lemma 5 in terms of consistent underestimators, accounting for the maximum contraction in the diffusion parameters.

Corollary 5.1.

If a set of representatives $R\subseteq U$ form a $\gamma-$ net for weight $w$ of $\mathcal{U}$ and $\mathcal{D}$ is $(p,\frac{2\gamma+\alpha^{\prime}}{1-c})-$ diffuse on $\mathcal{U}$ , then the $\alpha-$ submetric $\mathcal{D}_{r}^{\prime}$ , formed from $\alpha-$ consistent underestimators with maximum contraction $\alpha^{\prime}$ , is $(p^{\prime},c)-$ nontrivial for $\mathcal{U}$ , where $p^{\prime}\geq p-(1-w)^{2}$ .

The nontriviality guarantees of Lemmas 18 and 5 are conservative. They incorporate a worst-case assumption on the distribution of large distances in Lemma 5, and entirely ignore the exact distance preservation from the representatives in both Lemmas. Again, we stress that our goal in this section is to show the possibility of positive results, and we do not attempt to achieve optimal performance or guarantees.

Corollary 5.2 restates the Lemma directly in terms of the probability that at least one element in the pair sampled is covered by the $\gamma-$ net and the distance is greater than $\frac{2\gamma+\alpha^{\prime}}{1-c}$ in order to get a tighter characterization of nontriviality.

Corollary 5.2.

If a set of representatives $R\subseteq U$ form a $\gamma-$ net for a subset $V\subseteq U$ , and $\Pr_{u,v\sim\mathcal{U}\times\mathcal{U}}[(u\in V\lor v\in V)\wedge(\mathcal{D}(u,v)>\frac{2\gamma+\alpha^{\prime}}{1-c})]\geq p$ , then the $\alpha-$ submetric $\mathcal{D}_{r}^{\prime}$ , formed from $\alpha-$ consistent underestimators with maximum contraction $\alpha^{\prime}$ , is $(p,c)-$ nontrivial for $\mathcal{U}$ .

Given a set of representatives, it is possible to empirically measure $p$ on a sample to improve the bounds given by Lemma 5 or Corollary 5.2. For maximum generality, we will rely only on the density and diffusion properties of the metric and distribution, but we include Corollary 5.2 as a reminder that the bounds given are by no means tight.

6.4.2 Representative set size

We now consider how likely it is that a set of random representatives drawn from $\mathcal{U}$ will form a $\gamma-$ net for $\mathcal{U}$ given the density properties of $\mathcal{D}$ on $\mathcal{U}$ . Lemma 6 (restated below) states that a set of random representatives $R$ of size $O(\frac{1}{b}\ln(\frac{1}{b\delta}))$ will be sufficient to guarantee with high probability that the submetric $\mathcal{D}_{R}$ constructed from exact evaluations of $\mathcal{D}_{r}$ via queries to the human fairness arbiter on new samples from $\mathcal{U}\times\mathcal{U}$ will have nontriviality properties related to the density and diffusion of $\mathcal{D}$ for $\mathcal{U}$ .

Lemma 6 (Restatement).

If a metric $\mathcal{D}$ is $(\gamma,a,b)-$ dense and $(p,\frac{6\gamma}{1-c})-$ diffuse on $\mathcal{U}$ , then a random set of representatives $R$ of size at least $\frac{1}{b}\ln(\frac{1}{b\delta})$ will produce a $(p-(1-a)^{2},c)$ -nontrivial submetric $\mathcal{D}_{R}$ for $\mathcal{U}$ with probability at least $1-\delta$ , where $\mathcal{D}_{R}$ is constructed from exact evaluations of $\mathcal{D}_{r}$ via queries to the human fairness arbiter.

Proof.

Notice that if a set of representatives $R\subseteq U$ forms a $3\gamma-$ net for an $a$ fraction of $\mathcal{U}$ , then by Lemma 5 the submetric $\mathcal{D}_{R}$ will be $(p^{\prime},c)-$ nontrivial for $p^{\prime}\geq p-(1-a)^{2}$ .

Suppose that a metric is $(\gamma,a,b)$ dense. Denote the weight $a$ subset of $U$ (with associated weight $b$ $\gamma-$ close subsets) as $A$ . Suppose that a random sample $R\sim\mathcal{U}$ of size $m$ does not form an $\gamma-$ net for $A$ . Then it must be the case that there is at least weight $b$ of $\mathcal{U}$ not included in $R$ . That is, the associated weight $b$ subset of at least one element in $A$ is not “hit” by any representative. Thus, it is sufficient to bound the probability that weight $b$ of $\mathcal{U}$ corresponding to an element in $A$ is not hit by a sample of size $m$ to determine if our sample forms an $\gamma-$ net for $A$ , satisfying the conditions of the lemma.

As a warm-up, suppose that all of the weight $b$ subsets corresponding to elements in $A$ are disjoint. The probability that all $m$ samples do not fall into a particular weight $b$ subset of $\mathcal{U}$ is $(1-b)^{m}$ . Notice that if all elements in $A$ have disjoint associated weight $b$ subsets, then the probability that all $m$ samples do not fall into at least one of the disjoint weight $b$ subsets of $\mathcal{U}$ is at most $\frac{1}{b}(1-b)^{m}$ . (Notice, that there are at most $\frac{1}{b}$ disjoint weight $b$ subsets in total weight $1$ .) Rearranging and substituting $1-b\leq e^{-b}$ , any $m$ which satisfies:

[TABLE]

will fail to hit any subset of weight at least $b$ with probability at most $\delta$ . Thus, if the associated weight $b$ subsets for $A$ are disjoint, a set of representatives of size $\frac{1}{b}\ln(\frac{1}{b\delta})$ is sufficient to produce a $\gamma-$ net for $A$ with probability at least $1-\delta$ .

Now, consider the (more likely) case that the weight $b$ subsets for elements in $A$ are not disjoint. We will show that there is a set of disjoint weight $b$ subsets, $B_{remain}$ , such that if every disjoint subset in $B_{remain}$ is “hit” by at least one element in $R$ , then every element in $A$ is at most distance $3\gamma$ from a representative, i.e. $R$ forms a $3\gamma-$ net for $A$ .

Consider the entire set of weight $b$ subsets associated with elements in $A$ . Now, suppose that we removed the minimum number of subsets such that the remaining weight $b$ subsets were all disjoint. Call the minimal set of removed subsets $B_{remove}$ , and the set of remaining disjoint weight $b$ subsets $B_{remain}$ . Consider removing each subset in $B_{remove}$ one at a time. The last subset removed must have overlap with at least one subset in $B_{remain}$ , or there would be a smaller minimum set we could have removed which does not contain the last subset. Notice that we may remove the subsets in $B_{remove}$ in any order, and yet this observation still holds for the final subset removed. Thus, each subset in $B_{remove}$ must have overlap with a subset in $B_{remain}$ , so the furthest any element in a subset in $B_{remove}$ could be from a representative that “hits” a set in $B_{remain}$ is $4\gamma$ . However, an element in $A$ in associated with a weight $b$ subset in $B_{remove}$ can only be distance $3\gamma$ from the hitting representative, as it is at most distance $\gamma$ from at least one of the element(s) overlapping with $B_{remain}$ , which are in turn at most distance $2\gamma$ from the hitting representative.

As in the disjoint case above, the size of $B_{remain}$ is bounded by $1/b$ , and the same logic applies, but forming a $3\gamma-$ net. Thus for a set of randomly sampled representatives of size $m\geq\frac{1}{b}\ln(\frac{1}{b\delta})$ , the probability of the representatives chosen not forming a $3\gamma-$ net for weight $a$ of $\mathcal{U}$ is at most $1-\delta$ . ∎

Corollary 6.1 is the consistent underestimator analog of Lemma 6.

Corollary 6.1.

If a metric $\mathcal{D}$ is $(\gamma,a,b)-$ dense and $(p,\frac{6\gamma+\alpha^{\prime}}{1-c})-$ diffuse on $\mathcal{U}$ , then a random set of representatives $R$ of size at least $\frac{1}{b}\ln(\frac{1}{b\delta})$ will produce a $(p-(1-a)^{2},c)$ -nontrivial $\alpha-$ submetric $\mathcal{D}_{R}^{\prime}$ , constructed from exact evaluations of $\alpha-$ consistent underestimators, for each representative with maximum contraction $\alpha^{\prime}$ for $\mathcal{U}$ with probability at least $1-\delta$ .

Our strategy of using random representatives is motivated by a desire for as much generality as possible with respect to the form of the metric. However, random sampling is not the only method to construct a $\gamma-$ net.

Remark 1.

Choosing a set at random to form a $\gamma-$ net ignores the information provided by each of the representatives. A $\gamma-$ net for a fixed sample, or some weight of a fixed sample, can be constructed via a greedy algorithm rather than random sampling. The key obstacle to analyzing the effectiveness of a greedy procedure is that the choice of the next representative, based on the weight of elements it may add to the net, can be based only on the existing incomplete distance information. In some cases, this incomplete information may lead to very sub-optimal choices. However, there may be procedures which take advantage of quad queries and rough ordering information to reduce the number of mistakes made, at the cost of additional queries to the human fairness arbiter. For example, quad queries can be used to check a small sample of the elements a candidate representative $r_{c}$ is expected to add against a known distance pair of approximately distance $\gamma$ in order to better estimate the expected contribution to the net. We anticipate that such strategies may be useful in practice, even if a rigorous theoretical analysis for arbitrary metrics is pessimistic. It may also be useful to characterize the set of metrics which have bounded error in this incomplete information scenario, and we pose this as an open question for future work.

6.5 Generalization with random representative sets

Thus far we have shown that a random set of representatives can have good properties for new samples drawn from the distribution, assuming we construct the submetric from exact evaluations of $\mathcal{D}_{r}$ or $\mathcal{D}_{r}^{\prime}$ , i.e. with unlimited access to the human fairness arbiter. We now combine the results of Theorem 17 and Lemma 6 to show how to construct an efficient submetric learner which produces submetrics with good nontriviality properties, given limited query access to the arbiter for training data generation.

Algorithm 8 picks a set of representatives which will form a $\gamma-$ net for weight of a $(\gamma,a,b)-$ dense metric with probability at least $1-\delta/2$ . (Recall from Lemma 6 that the number of representatives required depends only on $b$ for a $(\gamma,a,b)-$ dense metric.) These representatives are then used to specify a set of $\alpha_{\mathcal{T}}-$ submetric learners (via Algorithm 6) which are passed to Algorithm 7 to construct a good final combined submetric with probability at least $1-\delta/2$ . (That is, Algorithm 8 splits its failure probability “budget” evenly between the choice of representatives and the learners for each representative.)

Theorem 19.

Given a distance metric $\mathcal{D}$ , and a distribution $\mathcal{U}$ over the universe if

There exist a set of thresholds $\mathcal{T}$ and efficient learners $\{L_{t_{i}\in\mathcal{T}}^{r}\}$ as in Assumption 1, and 2. 2.

$\mathcal{D}$ * is * $(\gamma,a,b)-$ *dense and * $(p,\frac{6\gamma+\alpha_{\mathcal{T}}}{1-c})-$ diffuse on $\mathcal{U}$ ,

then there exists an efficient submetric learner which produces a hypothesis $h_{R}$ with probability greater than $1-\delta$ such that

$\Pr_{x,y\sim\mathcal{U}\times\mathcal{U}}[h_{R}(x,y)\geq\mathcal{D}(x,y)+\alpha_{\mathcal{T}}]\leq\varepsilon.$ ** 2. 2.

$h_{R}$ * is * $(p-(1-a)^{2}-\varepsilon,c)-$ nontrivial for $\mathcal{U}$ .

which runs in time $O(poly(\frac{1}{b}\ln(\frac{1}{b\delta}),|\mathcal{T}|,\frac{1}{\varepsilon},\frac{1}{\delta}))$ for all $\varepsilon,\delta\in(0,1]$ .

Proof.

Claim: Algorithm 8 parametrized with a set of thresholds and learners as specified in Assumption 1 and $b$ for a $(\gamma,a,b)-$ dense metric is an efficient submetric learner as specified in the Theorem statement. We prove the claim with respect to each aspect of the theorem separately for clarity.

Running time. Algorithm 8 makes a single call to Algorithm 7 which runs in time $O(poly(|R|,|\mathcal{T}|,\frac{1}{\varepsilon_{R}},\frac{1}{\delta_{R}}))$ (per Theorem 17). The parameters are set such that $|R|=\frac{1}{b}\ln(\frac{2}{b\delta})$ , $\varepsilon_{R}=\varepsilon$ and $\delta_{R}=\delta/2$ . Thus Algorithm 8 runs in time $O(poly(\frac{1}{b}\ln(\frac{1}{b\delta}),|\mathcal{T}|,\frac{1}{\varepsilon_{R}},\frac{1}{\delta_{R}}))$ .

Failure probability. The failure probability $\delta$ is split evenly between the failure to produce a good set of representatives (per Lemma 6) and the failure probability of Algorithm 7.

Overestimate error probability. Algorithm 7 is invoked directly with $\varepsilon$ , so the overestimate error probability is $\varepsilon$ .

Nontriviality. The probability that the set of randomly chosen representatives in Algorithm 8 does not form a $3\gamma$ -net for at least $a$ weight of $\mathcal{U}$ is less than or equal to $\delta/2$ per Lemma 6. Given that the randomly chosen representatives do form a $3\gamma-$ net, notice that as in Corollary 5.1, if a $p-$ fraction of distances in $\mathcal{U}\times\mathcal{U}$ have distance greater than $\frac{6\gamma+\alpha_{\mathcal{T}}}{1-c}$ that $h_{R}$ is $(p-(1-a)^{2}-\varepsilon,c)-$ nontrivial. ∎

In the spirit of Corollary 17.1, we can also re-state Theorem 19 in terms of arbitrary learners for $h_{r}$ , rather than constructing directly from threshold function learners.

Corollary 19.1.

Given a distance metric $\mathcal{D}$ and a distribution $\mathcal{U}$ over the universe if

There exist a set of efficient learners $L=\{L_{r}\}$ such that given access to labeled samples, each $L_{r}$ produces a hypothesis $h_{r}$ such that $\Pr_{x,y\sim\mathcal{U}\times\mathcal{U}}[|h_{r}(x,y)-\mathcal{D}(u,v)|\geq\alpha]\leq\varepsilon_{r}$ with probability at least $1-\delta_{r}$ , and 2. 2.

$\mathcal{D}$ * is * $(\gamma,a,b)-$ *dense and * $(p,\frac{6\gamma+\alpha}{1-c})-$ diffuse on $\mathcal{U}$ ,

then there exists an efficient submetric learner which produces a hypothesis $h_{R}$ with probability greater than $1-\delta$ such that

$\Pr_{x,y\sim\mathcal{U}\times\mathcal{U}}[h_{R}(x,y)\geq\mathcal{D}(x,y)+\alpha]\leq\varepsilon.$ ** 2. 2.

$h_{R}$ * is * $(p-(1-a)^{2}-\varepsilon,c)-$ nontrivial for $\mathcal{U}$ .

which runs in time $O(poly(\frac{1}{b}\ln(\frac{1}{b\delta}),\frac{1}{\varepsilon},\frac{1}{\delta}))$ for all $\varepsilon,\delta\in(0,1]$ .

Theorem 19 can also be restated to take into account postprocessing to reduce the additive error. In particular, Corollary 19.2 trades an increase of $\alpha$ in diffusion for a reduction of $\alpha$ in the error guarantee.

Corollary 19.2.

Given a distance metric $\mathcal{D}$ and a distribution $\mathcal{U}$ over the universe if

There exist a set of efficient learners $L=\{L_{r}\}$ such that given access to labeled samples, each $L_{r}$ produces a hypothesis $h_{r}$ such that $\Pr_{x,y\sim\mathcal{U}\times\mathcal{U}}[|h_{r}(x,y)-\mathcal{D}(u,v)|\geq\alpha]\leq\varepsilon_{r}$ with probability at least $1-\delta_{r}$ , and 2. 2.

$\mathcal{D}$ * is * $(\gamma,a,b)-$ *dense and * $(p,\frac{6\gamma+2\alpha}{1-c})-$ diffuse on $\mathcal{U}$ ,

then there exists an efficient submetric learner which produces a hypothesis $h_{R}$ with probability greater than $1-\delta$ such that

$\Pr_{x,y\sim\mathcal{U}\times\mathcal{U}}[h_{R}(x,y)\geq\mathcal{D}(x,y)]\leq\varepsilon$ ** 2. 2.

$h_{R}$ * is * $(p-(1-a)^{2}-\varepsilon,c)-$ nontrivial for $\mathcal{U}$ .

which runs in time $O(poly(\frac{1}{b}\ln(\frac{1}{b\delta}),\frac{1}{\varepsilon},\frac{1}{\delta}))$ for all $\varepsilon,\delta\in(0,1]$ .

6.5.1 Human fairness arbiter Query Complexity

We now formally reason about the query complexity to the human fairness arbiter to generate training data for Algorithm 8.

Theorem 20.

Assuming the minimum gap between any pair of thresholds in $\mathcal{T}$ is at most a constant $\alpha$ , sufficient labeled training data for Algorithm 8 can be produced from $O(\log(\hat{N}\frac{1}{b}\ln(\frac{1}{b\delta})))$ queries to $\mathsf{O}_{\mathsf{REAL}}$ and $\frac{1}{b}\ln(\frac{1}{b\delta})\hat{N}\log(\hat{N})$ queries to $\mathsf{O}_{\mathsf{TRIPLET}}$ , and $O(\frac{1}{b}\ln(\frac{1}{b\delta})\hat{N}\log(\frac{1}{b}\ln(\frac{1}{b\delta})))$ queries to $\mathsf{O}_{\mathsf{QUAD}}$ , where $\hat{N}=O(poly(\frac{1}{b}\ln(\frac{1}{b\delta}),\frac{1}{\alpha},\frac{1}{\varepsilon},\frac{1}{\delta})$ is the number of samples required to train a single threshold learner.

Proof.

First, notice that no assumption is made in the proofs of error or failure probability which requires independence of error or failure probability of the threshold function hypotheses. Thus, labeling a single set of training data at granularity $\alpha$ is sufficient for all of the threshold function learners for a single representative, i.e. we do not need to label a new sample for each threshold function.

Call the number of samples needed to train a threshold function $\hat{N}$ . Recall from Assumption 1 that the threshold function learners run in time $O(poly(\frac{1}{\varepsilon_{t}},\frac{1}{\delta_{t}}))$ . The choice of parameters in Algorithms 8, 7 and 6 result in

[TABLE]

and

[TABLE]

Thus the threshold function learners run in time $O(poly(\frac{1}{\varepsilon},\frac{1}{\delta},\frac{1}{\alpha},\frac{1}{b}\ln(\frac{1}{b\delta})))$ , and can use no more than that many samples.

Recall from Theorem 15 that to produce labels for a set of elements of size $N$ to granularity $\alpha$ for a set of representatives $R$ , we required at most $O(\log(|R|N))$ queries to $\mathsf{O}_{\mathsf{REAL}}$ and $O(|R|N\log(N))$ queries to $\mathsf{O}_{\mathsf{TRIPLET}}$ and $O(|R|N\log(|R|))$ queries to $\mathsf{O}_{\mathsf{QUAD}}$ . Therefore to produce labels for a universe of size $\hat{N}=O(poly(\frac{1}{\varepsilon},\frac{1}{\delta},\frac{1}{\alpha},\frac{1}{b}\ln(\frac{1}{b\delta})))$ , for a set of representatives of size $|R|=O(\frac{1}{b}\ln(\frac{1}{b\delta}))$ we will require $O(\log(\hat{N}\frac{1}{b}\ln(\frac{1}{b\delta})))$ queries to $\mathsf{O}_{\mathsf{REAL}}$ and $\frac{1}{b}\ln(\frac{1}{b\delta})\hat{N}\log(\hat{N})$ queries to $\mathsf{O}_{\mathsf{TRIPLET}}$ and $\frac{1}{b}\ln(\frac{1}{b\delta})\hat{N}\log(\frac{1}{b}\ln(\frac{1}{b\delta}))$ queries to $\mathsf{O}_{\mathsf{QUAD}}$ . ∎

7 Relaxed query model

In this section, we extend our results to a second, relaxed arbiter model, in which arbiters are not expected to make arbitrarily small distinctions between distances or individuals or to provide arbitrarily precise real-valued distances. The relaxed model assumes that there are two fixed constants, $\alpha_{L}$ , the minimum precision with which the arbiter can distinguish elements or distances, and $\alpha_{H}$ , a bound on the magnitude of the (potentially biased) noise in the arbiter’s real-valued responses. For any comparisons with difference smaller than $\alpha_{L}$ , the arbiter declares the elements indistinguishable or the difference “too close to call.” The model allows for a “gray area” between $\alpha_{L}$ and $\alpha_{H}$ in which the arbiter may either respond with the true answer or “too close to call.” For any differences larger than $\alpha_{H}$ , the arbiter responds with the true answer. We assume that $\alpha_{L}$ and $\alpha_{H}$ are fixed constants for each task and cannot be manipulated.272727i.e., we cannot parametrize $\alpha_{L}$ or $\alpha_{H}$ to use the arbiter as an arbitrary threshold distinguisher to improve query complexity and accuracy tradeoffs.

Definition 19 (Real-valued query (too close to call model)).

$\mathsf{O}_{\mathsf{REAL}}^{\mathsf{TCTC}}(u,v):=\mathcal{D}(u,v)\pm\eta$ , where $|\eta|<\alpha_{H}$ i.e. the human fairness arbiter provides a real-valued distance between $u$ and $v$ with error at most $\alpha_{H}$ .

Definition 20 (Triplet query (too close to call model)).

Given $a,b,c\in U$ , define $\mathsf{diff}:=|\mathcal{D}(a,b)-\mathcal{D}(a,c)|$ .

[TABLE]

Definition 21 (Quad query(too close to call model)).

Given $a,b,x,y\in U$ , define $\mathsf{diff}:=|\mathcal{D}(a,b)-\mathcal{D}(x,y)|$ .

[TABLE]

In addition to the arbiter consistency assumptions enumerated in Section 3 (all arbiters agree, query responses do not change over the learning period, real-valued and relative query responses are consistent), we also assume that if the arbiter answers that the distances between ( $a$ and $b$ ) and ( $x$ and $y$ ) are indistinguishable, then the real-valued distances will also be at most $\alpha_{H}$ apart, and analogously that if the distances are distinguishable, then the real-valued distances will be at least $\alpha_{L}$ apart.

In the remainder of this section, we extend our results from the original “exact” model to this “too close to call” (TCTC) model. We show that the too close to call model allows for a significant reduction in real-valued query complexity (from logarithmic to constant) but at the cost of always having perceivable additive error in the submetrics produced, i.e. no $\alpha-$ submetric for $\alpha<2\alpha_{H}$ can be achieved without postprocessing and a corresponding trade-off in non-triviality parameters.

7.1 Submetrics from human judgements in the too close to call model

We first extend the results of Section 4 to the too close to call model. Algorithm 9 is the too close to call analog of Algorithm 4.282828As before, we use $\mathsf{MidpointOf}$ to specify the midpoint function, which chooses the midpoint for audit length lists and rounds down for even length lists. Algorithm 9 follows the same basic recipe of sorting and then labeling, but the sorting step produces sorted sets whose distances from the representative are indistinguishable. The elements in each set are then labeled with the distance between a distinguished element in the set and the representative. Thus the error of a distance label is a combination of the error of the real valued query for the distinguished element and the difference with the distinguished element’s distance to the representative.

Theorem 21 is the too close to call analog of Theorem 14, and states that Algorithm 9 requires only $O(\frac{1}{\alpha_{L}})$ real-valued queries and $O(N\log(N))$ triplet queries to produce a $4\alpha_{H}$ -submetric. The proof of the theorem follows from observing that there are at most $O(\frac{1}{\alpha_{L}})$ sets of elements with indistinguishable differences in distance (i.e. differences of less than $\alpha_{L}$ ) in $[0,1]$ , and that order $\alpha_{H}$ errors in the distance labels accrue in the mapping to distinguish elements and in the real valued queries for the distinguished element. The primary difference between Algorithms 9 and 4 is that in Algorithm 9 the arbiter identifies when elements or distances indistinguishable (difference less than $\alpha_{L}$ ) from relative queries alone. Thus Algorithm 9 groups indistinguishable elements together and acts only on the subset of distinguishable elements, which has size bounded by $O(\frac{1}{\alpha_{L}})$ . The removal of the $\log(N)$ dependency in the number of real-valued queries compared with Lemma 13 follows from the query model explicitly preventing any recursive calls on ranges of size less than $\alpha_{L}$ .

Theorem 21.

Given access to $\mathsf{O}_{\mathsf{REAL}}^{\mathsf{TCTC}}$ and $\mathsf{O}_{\mathsf{TRIPLET}}^{\mathsf{TCTC}}$ , Algorithm 9 constructs a $4\alpha_{H}-$ submetric from $O(\frac{1}{\alpha_{L}}$ ) queries to $\mathsf{O}_{\mathsf{REAL}}^{\mathsf{TCTC}}$ and $O(N\log(\frac{1}{\alpha_{L}}))$ queries to $\mathsf{O}_{\mathsf{TRIPLET}}^{\mathsf{TCTC}}$ which preserves distances (up to the additive error) from a representative $r$ .

Proof.

Query complexity. Queries to $\mathsf{O}_{\mathsf{TRIPLET}}^{\mathsf{TCTC}}$ are made only by $\mathsf{BinaryInsertTCTC}$ . Notice that if an element in $\mathsf{BinaryInsertTCTC}$ is ever within distance $\alpha_{L}$ of an existing item in the list, it is added to $\mathsf{NearCollisionList}$ rather than the working ordering, $L$ . Thus, although $\mathsf{BinaryInsertTCTC}$ is called for each element, it operates on a list of size at most $O(\frac{1}{\alpha_{L}})$ , as there are at most $O(\frac{1}{\alpha_{L}})$ elements with distances from $r$ which are different by at least $\alpha_{L}$ , and $L$ contains at most one element from each indistinguishable set. Thus, it makes $O(\log(\frac{1}{\alpha_{L}}))$ recursive calls for each element, yielding the desired bound of $O(N\log(\frac{1}{\alpha_{L}}))=O(N)$ queries to $\mathsf{O}_{\mathsf{TRIPLET}}^{\mathsf{TCTC}}$ . The desired bound on real-valued queries follows from observing that the ordering labeled by $\mathsf{O}_{\mathsf{REAL}}^{\mathsf{TCTC}}$ (Lines 3-5) has at most $O(\frac{1}{\alpha_{L}})$ elements.

Correctness. Each element is considered by the algorithm, and is either sorted into the correct position in $L$ via binary search or, if its distance is indistinguishable from that of another element in $L$ , it is added to an associated set of indistinguishable elements. Once the elements are sorted, each element $x$ is either labeled with its true distance with less than $\alpha_{H}$ error, i.e. $\mathsf{O}_{\mathsf{REAL}}^{\mathsf{TCTC}}(r,x)$ , or the distance of a distinguished element whose distance from $r$ is within $\alpha_{H}$ of its distance from $r$ . Thus $|f_{r}(y)-\mathcal{D}(r,y)|<2\alpha_{H}$ , accounting for the additional error in evaluations of $\mathsf{O}_{\mathsf{REAL}}^{\mathsf{TCTC}}$ . Therefore $|\mathcal{D}_{r}^{\prime}(x,y)-|\mathcal{D}(r,x)-\mathcal{D}(r,y)||\leq 4\alpha_{H}$ . ∎

Theorem 22 likewise extends the results of Theorem 15 to the too close to call model using the same observations as in the proof of Theorem 21. As in Algorithm 5, Algorithm 10 uses quad queries to sort (element, representative) pairs, and then labels the resulting list at the specified granularity. As in Algorithm 9, the sorting step produces a sorted list of indistinguishable (element, representative) pair sets of size bounded by $O(\frac{1}{\alpha_{L}})$ .

Theorem 22.

Given a set of representatives $R$ and access to $\mathsf{O}_{\mathsf{REAL}}^{\mathsf{TCTC}}$ and $\mathsf{O}_{\mathsf{QUAD}}^{\mathsf{TCTC}}$ , a $4\alpha_{H}-$ submetric can be constructed from $O(\frac{1}{\alpha_{L}})$ queries to $\mathsf{O}_{\mathsf{REAL}}^{\mathsf{TCTC}}$ and $O(|R|N\log(\frac{1}{\alpha_{L}}))$ queries to $\mathsf{O}_{\mathsf{QUAD}}^{\mathsf{TCTC}}$ which preserves distances (up to the additive error) from the set of representatives $R$ .

Proof.

Consider Algorithm 10.

Query complexity. Queries to $\mathsf{O}_{\mathsf{QUAD}}^{\mathsf{TCTC}}$ are made only by $\mathsf{BinaryInsertPairTCTC}$ . As in the proof of Theorem 21, if an element in $\mathsf{BinaryInsertPairTCTC}$ is ever within distance $\alpha_{L}$ of an existing item in the list, it is added to $\mathsf{NearCollisionList}$ rather than the working ordering, $L$ . Thus, although $\mathsf{BinaryInsertPairTCTC}$ is called for each element $|R|$ times, it operates of a list of size at most $O(\frac{1}{\alpha_{L}})$ so it makes $O(\log(\frac{1}{\alpha_{L}}))$ recursive calls for each element, yielding the desired bound of $O(|R|N\log(\frac{1}{\alpha_{L}}))$ queries to $\mathsf{O}_{\mathsf{QUAD}}^{\mathsf{TCTC}}$ . The desired bound on real-valued queries follows from observing that the ordering labeled by $\mathsf{O}_{\mathsf{REAL}}^{\mathsf{TCTC}}$ (Lines 3-5) has at most $O(\frac{1}{\alpha_{L}})$ elements, as in the proof of Theorem 21.

Correctness. By the same logic as in the proof of Theorem 21, each element $x$ is either labeled with its true distance with at most $\alpha_{H}$ error, $\mathsf{O}_{\mathsf{REAL}}^{\mathsf{TCTC}}(r,x)$ or the distance of a distinguished element whose distance from $r$ is within $\alpha_{H}$ of its distance. Thus $|f_{r}(y)-\mathcal{D}(r,y)|<2\alpha_{H}$ , accounting for the additional error in evaluations of $\mathsf{O}_{\mathsf{REAL}}^{\mathsf{TCTC}}$ . Therefore $|\mathcal{D}_{r}^{\prime}(x,y)-|\mathcal{D}(r,x)-\mathcal{D}(r,y)||\leq 4\alpha_{H}$ . ∎

Bounds on perceivable error.

Unlike in the exact arbiter model in which the additive error of a submetric can be made an arbitrarily small constant, Algorithms 9 and 10 result in additive error at least $4\alpha_{H}$ . A reasonable question is whether any query procedure in the too close to call model can produce a submetric with no perceivable additive error without some additional post-processing. Indeed, even with the naive construction of asking $O(N)$ real-valued queries the submetric produced can have additive error strictly greater than $\alpha_{H}$ , without further post-processing.

Proposition 23.

The representative submetric $\mathcal{D}_{r}(x,y):=|\mathsf{O}_{\mathsf{REAL}}^{\mathsf{TCTC}}(r,x)-\mathsf{O}_{\mathsf{REAL}}^{\mathsf{TCTC}}(r,y)|$ can have additive error greater than $\alpha_{H}$ .

The proof of the proposition follows from the observation that each query to $\mathsf{O}_{\mathsf{REAL}}^{\mathsf{TCTC}}$ has error of $\eta$ such that $|\eta|<\alpha_{H}$ . Thus, in the worst case, when $\eta>\alpha_{H}/2$ , $(\mathcal{D}(r,x)+\eta)-(\mathcal{D}(r,y)-\eta)=\mathcal{D}(r,x)-\mathcal{D}(r,y)+2\eta>\mathcal{D}(r,x)-\mathcal{D}(r,y)+\alpha_{H}$ . So we cannot expect to produce submetrics without perceptible error without some additional post-processing on the values queried from the arbiter.

7.2 Generalization

We now turn our attention to extending the generalization results of Sections 5 and 6. Notice that unlike the exact model we won’t necessarily be able to label a sample with 100% accuracy for every threshold function for every representative. The key problem is that in the too close to call model each element’s distance from a representative has bi-directional error, i.e. it can have either over or under-estimated distance from the representative. This bi-directional error prevents us from using the nice properties of a consistent underestimator, so translating to the desired binary labels for a given threshold is not straightforward. To get around this labeling problem, we modify the distribution of samples presented to each learner, in particular eliminating samples whose labels are ambiguous. We then reason about the error of the combination of hypotheses for distributions with disjoint sets of ambiguous points removed.

Recall that Algorithm 10 assigns distances $f_{r}(x)$ such that $|f_{r}(x)-\mathcal{D}(r,x)|\leq 2\alpha_{H}$ . Thus, any element $x$ such that $f_{r}(x)>t_{i}+2\alpha_{H}$ is truly greater than distance $t_{i}$ from $r$ , (and analogously less than $t_{i}$ if $f_{r}(x)<t_{i}-2\alpha_{H}$ ). Intuitively, this means that we can generate accurate threshold function labels for points sufficiently $(2\alpha_{H})$ far from the threshold. We formally define the unambiguous threshold distribution below to capture only the elements whose relative distances are unambiguous.

Definition 22 (Unambiguous threshold distribution).

Given a distribution $\mathcal{U}$ over a universe of individuals $U$ , a representative $r$ , a labeling procedure which produces (noisy) distance labels with bi-directional additive error of at most $2\alpha_{H}$ from the representative $r$ , and a threshold $t$ , the unambiguous threshold distribution $\mathcal{U}_{t}^{r}$ is the re-normalized distribution $\mathcal{U}$ with all weights on elements labeled with distances within $2\alpha_{H}$ of $t$ set to 0.

Notice that the unambiguous threshold distribution is well-defined without knowledge of the exact distances from the representative, as it is specified based on the labeling procedure, rather than exact distances from the representative itself. Thus, we can reason about learning on the distribution $\mathcal{U}_{t}^{r}$ without worrying about whether any elements are ambiguously labeled. Algorithm 11 specifies a labeling procedure for training data for each of the threshold functions for the distributions $\mathcal{U}_{t}^{r}$ for $t\in\mathcal{T}$ , i.e., only samples with unambiguous labels.

In order for the threshold learners to succeed, we need sufficient labeled training data for each threshold function for each representative. However, Algorithm 11 tosses out any examples for a given learner which are too close to the threshold value. Thus, there is some risk that there are too few samples produced for a given threshold. Lemma 24 below states that Algorithm 11 generates a sufficient number of labeled samples of $\mathcal{U}_{t_{i}}^{r}$ all but one $t_{i}\in\mathcal{T}$ for each representative given an initial sample of size $|S|=3\hat{m}$ , where $\hat{m}$ is the number of labeled samples required for each threshold function learner.

Lemma 24.

Given a set of samples $S\sim\mathcal{U}$ of size $3\hat{m}$ and a set of thresholds $\mathcal{T}$ such that $|t_{i}-t_{j}|>2\alpha_{H}$ for all $i,j$ , Algorithm 11 produces at least $\hat{m}$ labeled examples from $\mathcal{U}_{t_{i}}^{r}$ for at least $|\mathcal{T}|-1$ of the thresholds in $\mathcal{T}$ .

Proof.

Correctness of labels. Recall from the proof of Theorem 22, that each element $u\in U$ is labeled with a distance $f_{r}(x)$ such that $|f_{r}(x)-\mathcal{D}(r,x)|\leq 2\alpha_{H}$ for each representative $r\in R$ . Thus, each element labeled above or below $t_{i}$ which is at least $2\alpha_{H}$ distant from $t_{i}$ is correctly labeled for the representative $r$ .

Quantity of labeled examples. Since each threshold is at least $2\alpha_{H}$ away from its neighboring thresholds, therefore any element which is discarded for $t_{i}$ is included as a labeled example for every other threshold $t_{j\neq i}$ for a representative. Suppose that at least one threshold $t_{k}$ has fewer than $\hat{m}$ labeled examples for a representative. Then at least $2\hat{m}$ examples were discarded for $t_{k}$ for the representative. However, this leaves at most $\hat{m}$ samples which could be discarded for any other threshold, so all of the other thresholds must have at least $2\hat{m}$ labeled samples for this representative. Thus, at most one threshold will have fewer than $\hat{m}$ labeled samples for each representative.

Correctness of distribution. $\mathcal{U}_{t_{i}}^{r}$ is defined with respect to the labeling procedure, and thus it is possible to simulate $\mathcal{U}_{t_{i}}^{r}$ by labeling elements and discarding an elements whose labels are ambiguous. Thus Algorithm 11 simulates $\mathcal{U}_{t_{i}}^{r}$ , and the sets of labeled data produced are indistinguishable from a set drawn from $\mathcal{U}_{t_{i}}^{r}$ directly.

∎

With the labeling procedure in place, we now introduce Algorithm 12, the too close to call analog of Algorithm 8. As in Algorithm 8, Algorithm 12 first samples a set of representatives, according to the size requirements of Lemma 6, generates a set of labeled samples for each threshold function via Algorithm 11 and then calls the threshold learners on the appropriate modified distributions with appropriately scaled parameters and combines their resulting hypotheses into a single hypothesis for the combined submetric.

To make the sample generation book-keeping clearer, we slightly modify the specification of the threshold learners (but not the core assumption) so that we can more clearly specify the sample distributions passed to each learner. We also introduce a minimum granularity for the thresholds determined by $\alpha_{H}$ .

Assumption 2.

Given a metric $\mathcal{D}$ and a representative $r$ , there exists a set of thresholds $\mathcal{T}$ such that

$t\in[0,1]$ for all $t\in\mathcal{T}$ , 2. 2.

$0\in\mathcal{T}$ , 3. 3.

$2\alpha_{H}<\alpha_{\mathcal{T}}=\max_{i\in[|\mathcal{T}|-1]}t_{i+1}-t_{i}$ , 4. 4.

$|\mathcal{T}|=O(1)$ ,

and for every $t\in\mathcal{T}$ there exists an efficient learner $L_{t}^{r}(\varepsilon_{t},\delta_{t},\mathcal{M}_{t}^{r})$ which for all $\varepsilon_{t},\delta_{t}\in(0,1]$ , with probability at least $1-\delta_{t}$ produces a hypothesis $h_{t}^{r}$ such that

[TABLE]

in time $O(poly(\frac{1}{\varepsilon_{t}},\frac{1}{\delta_{t}}))$ with access to labeled samples of $T_{t}^{r}(u\sim\mathcal{M}_{t}^{r})$ for any distribution $\mathcal{M}_{t}^{r}$ over the universe. That is, the concept class $T_{t}^{r}$ is efficiently learnable for all $t\in\mathcal{T}$ .

In order to prove the analog of the combined exact arbiter generalization (Theorem 19) we split the analysis into two steps. First, we analyze the error of $\mathsf{ThresholdCombinerTCTC}$ running on the modified distributions and threshold sets and adjust the density and diffusion requirements accordingly. We then complete the argument by analyzing the full $\mathsf{SubmetricLearnerTCTC}$ procedure parameter choices to derive the desired error and query complexity bounds.

Lemma 25 states that $\mathsf{ThresholdCombinerTCTC}$ , when parametrized with learners and samples from $\mathcal{U}_{t\in\mathcal{T}}^{r}$ where $\alpha_{\mathcal{T}}>2\alpha_{H}$ results in hypotheses which overestimate or underestimate distances by more no more than $4\alpha_{\mathcal{T}}$ with probability $\leq\varepsilon$ with high probability.

Lemma 25.

Given a set of thresholds such that for all $t_{i},t_{j}\in\mathcal{T}$ , $|t_{i}-t_{j}|=\alpha_{\mathcal{T}}>2\alpha_{H}$ , a set of learners as in Assumption 2, and access to sufficient labeled examples of $L_{t_{i}\in T}^{r}$ from $\mathcal{U}_{t_{i}}^{r}$ for all $t_{i}\in\mathcal{T}$ , with probability at least $1-\delta_{r}$ $\mathsf{ThresholdCombinerTCTC}$ produces a hypothesis $h_{r}$ such that

[TABLE]

i.e., the representative submetric is efficiently learnable.

Proof.

The essence of the proof is to show that in order for $h_{r}(x,y)$ to differ from $|f_{r}^{\mathcal{T}}(x)-f_{r}^{\mathcal{T}}(y)|$ by more than $4\alpha_{\mathcal{T}}$ , then at least one threshold function other than the true thresholds for $x$ and $y$ must be in error.

Labeling samples. Lemma 24 states that sufficient labeled samples can be produced for all but one of the thresholds for each representative. Given that we must assume that we fail to produce labeled training data for one of the thresholds, the maximum gap between any pair of thresholds with sufficient training data is $2\alpha_{\mathcal{T}}$ .

Producing a sufficiently large error in $\mathsf{LinearVote}$ . Consider an element $u$ such that its true distance from the representative $r$ is between $t_{i}$ and $t_{i+1}$ . That is, $T_{t_{>i}}^{r}(u)=1$ and $T_{t_{\leq i}^{r}}=0$ . First, notice that for $\mathsf{LinearVote}(\mathcal{T},H_{\mathcal{T}}^{r},u)$ to diverge from the true threshold, $t_{i}$ , by more than two threshold values, at least one threshold function hypothesis other than $h_{t_{i}}$ must be in error. Thus, even if $h_{t_{i}}$ is in error, at least one other $h_{t_{j}}$ must also be in error to produce $\mathsf{LinearVote}(\mathcal{T},H_{\mathcal{T}}^{r},u)\in\{t_{j}|j>i+2,j<i-2\}$ . Thus, it is sufficient to reason about the probability that at least one threshold hypothesis, other than the correct threshold is in error.

Error probability. Our analysis of error probability takes the worst-case assumption that for every element in $\mathcal{U}\backslash\mathcal{U}_{t_{i}}^{r}$ 292929For simplicity of notation, we use $\mathcal{U}\backslash\mathcal{U}_{t_{i}}^{r}$ to denote the set of items with positive weight in $\mathcal{U}$ but weight [math] in $\mathcal{U}_{t_{i}}^{r}$ ., the hypothesis $h_{t_{i}}^{r}$ is in error. For each threshold $t_{i}\in\mathcal{T}$ , we use $w_{i}$ to represent the weight of $\mathcal{U}\backslash\mathcal{U}_{t_{i}}^{r}$ under $\mathcal{U}$ . Recall from the proof of Lemma 24 that all of the $\mathcal{U}\backslash\mathcal{U}_{t_{i}}^{r}$ are disjoint, so $\sum_{i}w_{i}\leq 1$ . Notice that if each hypothesis is learned successfully, then it will have error probability at most $(1-w_{i})\varepsilon_{t}$ to distribute on $\mathcal{U}_{t_{i}}^{r}$ , and we assume that it always behaves badly on the weight $w_{i}$ region $\mathcal{U}\backslash\mathcal{U}_{t_{i}}^{r}$ . In the worst case, a threshold function can be in error outside of its “bad” region resulting in a mistake of more than $4\alpha_{\mathcal{T}}$ with probability at most $|T|\varepsilon_{t}$ . (Recall that the maximum gap between thresholds, accounting for the label generation is $2\alpha_{\mathcal{T}}$ .) Thus, the probability that $\mathsf{LinearVote}$ produces a value at least $4\alpha_{\mathcal{T}}$ distant from the true threshold value for either element in a pair drawn from $\mathcal{U}$ is at most $2\sum_{t_{i}\in\mathcal{T}}\frac{\varepsilon_{r}}{2|T|}=\varepsilon_{r}$ .

∎

The final piece is to state the full generalization result including non-triviality guarantees. This theorem statement and proof are nearly identical to the exact arbiter versions, with modifications only to account for the difference in the additive error parameter.

Theorem 26.

Given a distance metric $\mathcal{D}$ , and a distribution $\mathcal{U}$ over the universe if

There exist a set of thresholds $\mathcal{T}$ and efficient learners $\{L_{t_{i}\in\mathcal{T}}^{r}\}$ as in Assumption 2, and 2. 2.

$\mathcal{D}$ * is * $(\gamma,a,b)-$ *dense and * $(p,\frac{6\gamma+4\alpha_{\mathcal{T}}}{1-c})-$ diffuse on $\mathcal{U}$ ,

then there exists an efficient submetric learner which produces a hypothesis $h_{R}$ with probability greater than $1-\delta$ such that

$\Pr_{x,y\sim\mathcal{U}\times\mathcal{U}}[h_{R}(x,y)\geq\mathcal{D}(x,y)+4\alpha_{\mathcal{T}}]\leq\varepsilon.$ ** 2. 2.

$h_{R}$ * is * $(p-(1-a)^{2}-\varepsilon,c)-$ nontrivial for $\mathcal{U}$ .

which runs in time $O(poly(\frac{1}{b}\ln(\frac{1}{b\delta}),|\mathcal{T}|,\frac{1}{\varepsilon},\frac{1}{\delta}))$ for all $\varepsilon,\delta\in(0,1]$ .

Proof.

Claim: Algorithm 12 parametrized with a set of thresholds and learners as specified in Assumption 2 and $b$ for a $(\gamma,a,b)-$ dense metric is an efficient submetric learner as specified in the Theorem statement. We prove the claim with respect to each aspect of the theorem separately for clarity.

Running time. The running time argument is equivalent to the argument for the exact arbiter version, with the additional observation that labeling a sufficient number of samples requires an additional factor of $3$ samples.

Failure probability. As in the exact arbiter version, the failure probability $\delta$ is split evenly between the failure to produce a good set of representatives (per Lemma 6) and the failure probability of the representative submetric learners.

Overestimate error probability. The argument proceeds as in the exact arbiter version, relying on the error analysis of Lemma 25.

Nontriviality. The argument proceeds as in the exact arbiter version, with $\alpha_{\mathcal{T}}$ scaled to account for the additional error in the intermediate hypotheses $\{h_{r}|r\in R\}$ . ∎

Finally, we re-state Theorem 20 in the too close to call model to account for the improved query complexity in label generation.

Theorem 27.

Sufficient labeled training data for Algorithm 12 can be produced from $O(\frac{1}{\alpha_{L}})$ queries to $\mathsf{O}_{\mathsf{REAL}}$ and $O(\hat{N}\frac{1}{b}\ln(\frac{1}{b\delta})\log(\frac{1}{\alpha_{L}}))$ queries to $\mathsf{O}_{\mathsf{QUAD}}$ where $\hat{N}=O(poly(\frac{1}{b}\ln(\frac{1}{b\delta}),\frac{1}{\alpha_{H}},\frac{1}{\varepsilon},\frac{1}{\delta})$ is the number of samples required to train a single threshold learner given a set of evenly spaced thresholds $\mathcal{T}$ such that $t_{i}-t_{i-1}>2\alpha_{H}=o(1)$ .

Proof.

Call the number of samples needed to train a threshold function $\hat{N}$ . Recall from Assumption 1 that the threshold function learners run in time $O(poly(\frac{1}{\varepsilon_{t}},\frac{1}{\delta_{t}}))$ . The choice of parameters in Algorithms 8, 7 and 6 result in

[TABLE]

and

[TABLE]

Thus the threshold function learners run in time $O(poly(\frac{1}{\varepsilon},\frac{1}{\delta},\frac{1}{\alpha_{H}},\frac{1}{b}\ln(\frac{1}{b\delta})))$ , and can use no more than that many samples.

Recall from Lemma 24 that to label a set of $\hat{N}$ samples will require labeling $3\hat{N}$ samples via Algorithm 10. Recall from the proof of Theorem 22 that such a set of labels can be produced from $O(\frac{1}{\alpha_{L}})$ queries to $\mathsf{O}_{\mathsf{REAL}}^{\mathsf{TCTC}}$ and $O(|R|\hat{N}\log(\frac{1}{\alpha_{L}}))$ queries to $\mathsf{O}_{\mathsf{QUAD}}^{\mathsf{TCTC}}$ .

Therefore to produce labels for a universe of size $\hat{N}=O(poly(\frac{1}{\varepsilon},\frac{1}{\delta},\frac{1}{\alpha_{H}},\frac{1}{b}\ln(\frac{1}{b\delta})))$ , for a set of representatives of size $O(\frac{1}{b}\ln(\frac{1}{b\delta}))$ we will require $O(\frac{1}{\alpha_{L}})$ queries to $\mathsf{O}_{\mathsf{REAL}}^{\mathsf{TCTC}}$ and $O(\hat{N}\frac{1}{b}\ln(\frac{1}{b\delta})\log(\frac{1}{\alpha_{L}}))$ queries to $\mathsf{O}_{\mathsf{QUAD}}^{\mathsf{TCTC}}$ . ∎

As previously noted, although the query complexity guarantees are much improved in the too close to call model, the additive error of Algorithm 12 is greater than the perceivable error threshold of $\alpha_{H}$ . Of course, the metric designer may choose to post-process the resulting hypothesis be reducing every reported difference by $4\alpha_{\mathcal{T}}$ , resulting in a [math]-submetric with $\epsilon$ error probability. However, this will require a corresponding increase in the diffusion parameter to maintain the same non-triviality guarantee, as in Corollary 19.2.

Summary

We have shown that in cases where the arbiter is not required to answer queries with specificity below a certain level of granularity, $\alpha_{H}$ , that we can still achieve a small constant additive error of $O(\alpha_{H})$ with a constant number of real-valued distance queries $O(\frac{1}{\alpha_{L}})$ and $O(|R|N\log(\frac{1}{\alpha_{L}}))$ relative distance queries. This additional model provides a good initial step towards handling imprecise arbiter decisions. Extending to other error modalities is an important direction for future work.

8 Discussion

8.1 Summary of main results

In summary, we have established a useful framework of nontrivial submetrics as approximations to the true metric for Individual Fairness. We have also shown that constructing submetrics based on threshold rounding on distances from representative elements has both good over and under-estimate error properties.

We have examined a limited, realistic query model of relative distance queries and real valued queries, and have shown how to construct submetrics on a fixed universe of individuals with a sublinear number of real-valued queries and $O(|R|N\log(N))$ relative distance queries. These procedures are useful both as a complete solution for offline settings, where the whole universe to be classified is known in advance, and as a way to generate training data for other fair classification schemes.

We have also shown how to learn hypotheses for a submetric which generalize well to unseen samples based on limited assumptions of efficient learnability of threshold functions. We demonstrated a technique to obtain good nontriviality guarantees in a specific setting for two dimensional Euclidean distances, and a more general framework for reasoning about the performance of a small set of random representatives with a reasonable number of queries to the human fairness arbiter to generate labeled examples for training.

With the statement of our results completed, we now pose several points of discussion and critique of the work as areas for future work and improvement.

8.2 Metric structure

Many settings where fairness is critical involve high-dimensional or unstructured data, e.g. college applications which include years of grades, test scores, free text essays and recommendations and many other features. In Section 6, we showed one special case in which the metric structure could be exploited to create more accurate submetrics with fewer representatives. How likely is it that the true metric is low-dimensional with such “nice” structure? We contend this case is more likely than it may initially appear. Consider a human fairness arbiter tasked with determining similarity for college applicants. She cannot possibly hold the entire applicant’s feature description in her working memory at once and compare it line by line with the next applicant’s. Instead, the human fairness arbiter likely has an intuitive model of what it means to be a good student, perhaps someone who is talented and has good work ethic. As she compares students, her true comparison is based on these unobservable, complex mappings of the high dimensional application to talent and work ethic, which represent her judgment criteria for similarity. Even if the human fairness arbiter cannot articulate her mapping from the high dimensional applicant information to her low dimensional representation, her judgments which reflect the low dimensional representation can still be used for triangulation. There is also an opportunity to build on prior work concerning human decision-making and categorization in other disciplines. Further cross-disciplinary inquiry is likely to be highly beneficial to producing more realistic models of how humans encode similarity judgments.

In this work, we have relied on learning methods with particular theoretical guarantees for generalization and nontriviality with the goal of stating results independent from assumptions on the form of the metric. This focus has resulted in conservative nontriviality guarantees and numbers of representatives. In practice, exploring alternative methods based on metric structure assumptions, whether or not they have theoretical guarantees on outcomes, and instead budgeting some labeled data to measure empirical error may be more practical.

8.3 Resolving disagreements between human fairness arbiters

Thus far, we have assumed that our procedures either use a single, internally consistent human fairness arbiter or that multiple human fairness arbiters agree on all queries. Multiple human fairness arbiters is likely preferable from the perspective of better capturing society’s view, and can answer relative distance queries in parallel for Algorithm 3. However, the assumption that all human fairness arbiters agree on every query is not likely to hold up in practice.

In the case of small disagreements between human fairness arbiters, $\mathsf{minmerge}$ (defined analogously to $\mathsf{maxmerge}$ ) is a viable option. For example, if the ordering produced by two human fairness arbiters from a particular representative is the same, but there are inconsistencies in the real-valued queries (after any necessary scaling), $\mathsf{minmerge}$ will smooth out any small disagreements. Setting $\alpha_{H}$ and $\alpha_{L}$ to capture the varying levels of agreement can also have a similar effect in the too close to call model.

When human fairness arbiters strongly disagree, we consider this to be a situation where discussion between the human fairness arbiters, and perhaps additional external parties, is needed. If we assume that our human fairness arbiters are all fair-minded individuals (i.e., without explicitly unpalatable biases), then our interpretation of significant disagreements should be careful to acknowledge that disagreements may stem from either (1) differences in domain expertise, (2) genuine lack of consensus in society’s view of similarity for the task, (3) human or system error or bias in display of or acquisition of data, (4) other potentially serious failure modalities.

We view the potential for such disagreements as a feature, not a bug, and would be concerned if any system gathering judgments from human fairness arbiters never encountered disagreement. In the case of (1) we anticipate that there may be cases where a particular human fairness arbiter is selected precisely because she represents a unique viewpoint or has domain experience with different groups of individuals. Ensembles of human fairness arbiters with expertise in different groups of individuals may find augmenting the procedures outlined in this work with more nuanced merge and discussion steps for reconciliation between human fairness arbiters to be beneficial. In the case of lack of consensus (2), procedural fairness or other interventions may be more desirable than fairness derived from outcomes. We discuss human or system bias in Section 8.5.

A significant benefit of disagreement with a proposed submetric or individual query is that these disagreements represent specific, well-articulated cases rather than hypothetical or meta-disagreements. Our hope is that the discussion of specific cases will be more likely to result in agreement, either in the outcome or in the choice of an alternate procedure, than hypothetical cases or group-level statistics.

8.4 Selection of human fairness arbiters

Conspicuously missing from our discussion of human fairness arbiters is guidance on whom to select to be a human fairness arbiter. Our most basic requirement is that a human fairness arbiter be a “fair minded individual,” but practically speaking, this gives little indication of selection criteria. That being said, the selection of human fairness arbiters is likely to be critically important to the acceptance of any submetric produced. Our position is that selection of human fairness arbiters is a question which must be resolved at a philosophical, policy and social level. We can foresee many questions related to arbiter selection. For example, should historically disadvantaged groups be given the choice of some number of the human fairness arbiters? Is there some minimum qualification or “bias test” one must pass to be considered? Resolving, or even attempting to fully articulate such questions is well outside the scope of this work, and we anticipate that it is a significant area for future cross-disciplinary inquiry.

However, we are optimistic that selecting a group of human fairness arbiters is possible, because the learning process permits changing the set of human fairness arbiters or the merge strategy over time without “throwing away” past effort. Consider learning a separate submetric for each human fairness arbiter and merging these submetrics (either through $\mathsf{maxmerge},$ $\mathsf{minmerge}$ , or any other more nuanced merging strategy). Adding or removing a human fairness arbiter is not wholly destructive to the existing submetric, although this strategy may preclude parallelizing relative distance queries. Loosely speaking, we may not be able to give good guidance up front about who should be a human fairness arbiter, but we can produce submetrics in a way that adding or removing an arbiter from the set is straightforward, allowing the metric to evolve as our understanding or opinion of who should be in the set of human fairness arbiters and how their judgments should be combined evolves. This replacement strategy may also help in cases where opinions shift gradually over time, and older arbiter submetrics may be swapped out for newer judgments to reflect shifting attitudes.

8.5 Query process and interface design

The design and process implementation for the interactions with human fairness arbiters is a significant area for future work. Problems of anchoring, particularly if many individuals are compared to the same representative, in addition to other issues with human judgment will be a significant consideration in system design [17]. Alternative query types could be explored, or alternative presentation of queries could be made to improve the consistency of answers or try to counteract implicit biases.

Of particular concern with the design of the interface is how information is presented and whether the presentation will allow or encourage implicit biases to creep into judgments. It is likely impossible to remove all signal for sensitive attributes like race or gender from the presentation of information to the judge. Indeed, there are many cases where the inclusion of sensitive information is critical to evaluating fairness. One possible way to detect and correct implicit bias would be to explicitly ask the human fairness arbiter if they believe a sensitive feature should impact a particular judgment. If they respond that it should not, then the system could spot check by asking other arbiters to evaluate the same query with as much sensitive information stripped out or changed to an alternative as possible. If the evaluations of the other human fairness arbiters indicate that removing or changing the sensitive information resulted in different judgments, then additional care could be taken to reconcile the sensitive-attribute-blind responses. This is by no means a complete solution to removing implicit bias, but we think that exploring how information is presented and in particular comparing judgments based on differing information will be critical to gathering consistent and consistently fair judgments from the human fairness arbiters.

Any judgments human fairness arbiters make based on the information presented to them will be just that: based on the information presented to them. In many cases, we might want to allow the human fairness arbiter to gather or request additional information if it is important to their judgment. For example, a human fairness arbiter evaluating a college application might see that a student took a year away from school. She may determine that additional information is needed to make any meaningful comparisons, because a year away from school for medical or family reasons is very different than a year’s suspension. Building in a way for human fairness arbiters to gather more information, and document the information they find for any later evaluations to consider, is likely to be expensive but may also be necessary to produce valid judgments.

8.6 Arbiter agreement with submetrics

Our initial assumption might be that a set of human fairness arbiters will agree with a submetric learned based on their judgments, modulo error parameters. In the case of real valued distance queries, the procedures outlined in this work will result in submetrics which underestimate real-valued distances with small error with high probability. However, with respect to relative distances, agreement is not guaranteed. For example, all of the human fairness arbiters may agree that $a$ is more similar to $b$ than $c$ , but the submetric may consider $a$ more similar to $c$ than $b$ (while still maintaining smaller real value distances) depending on the choice of representative elements.303030 This is illustrated in Figure 1, in which choosing the representative $r1$ preserves the relative distance comparison between points $2$ , $4$ and $5$ ( $2$ is closer to $4$ than $5$ ) but choosing $r2$ does not. If all original distances are maintained to a sufficient degree, then relative distances will also be preserved. However, when trade-offs between distance preservation and human fairness arbiter cost must be made, there is the potential to violate relative distance judgments made by the human fairness arbiter.

Although this does not technically violate the Individual Fairness definition of [4], there may be many scenarios where treating dissimilar individuals dis-similarly is just as important as treating similar individuals similarly. For example, in the case of setting taxation rates for individuals, one would likely consider treating the wealthiest and poorest individuals the same but treating the middle class differently to be unfair. Augmenting the existing Individual Fairness definition with the requirement that dissimilar individuals be treated dis-similarly is not entirely straightforward. In particular, there is no binary classifier which will maintain relative distances between three equally distant individuals. However, given the uncomfortable idea that the human fairness arbiters may not agree with the relative distances produced, it seems worthwhile to consider whether, or in which cases, it is desirable or possible to strengthen the Individual Fairness definition of [4] to capture relative distance constraints.

8.7 When arbiters agree but learning is hard

An important scenario to consider is the case in which the human fairness arbiters agree on all or most queries, but our usual learning procedures fail to produce a submetric which generalizes to unseen samples. Again, we view this failure as a feature rather than a bug as it may indicate that either (1) there are alternative learning strategies we should try or (2) that the metric is complex enough that human oversight is always needed to make fair decisions. Our model of the arbiter evaluating distances over an unobservable set of relevant attributes is very similar to the “construct space” of Friedler et al., [7]. Friedler et al. put forth a formalization of fairness in which the goal is to achieve fairness over an unobservable construct space which captures the relevant attributes (e.g., grit, talent, work ethic, etc) but our information constrained to the “observed” space. In some sense, we take the view that the arbiter is acting as a translator between these unobserved, difficult to articulate attributes and the observed features. As such, there isn’t always a guarantee that the observed features available for classification will be sufficient to capture the nuance in the arbiters’ judgments. In some sense, replacing direct human judgments with automated decisions in sensitive settings should be viewed as a privilege and not a right. Sensitive settings in which human fairness arbiters agree, but our system cannot generalize in a way that they would agree with, should be subject to significant scrutiny and the replacement of human judgment with automated decision-making should not be taken as given.

8.8 Comparison of submetrics

In this work, we have been somewhat unsophisticated in our comparisons of alternative submetrics beyond the basic worst-case additive error measure and nontriviality.

In this work we primarily consider absolute additive error. However, practical evaluation of error may be based entirely on how much an adversary could “get away with” using a submetric to derive a classifier. Suppose we are concerned an adversary will discriminate against a large subset of individuals $V\subseteq U$ and derives utility proportional to the difference in distances between pairs of elements $(u,v)$ where $v\in V$ , and $u\in U\backslash V$ . A large number of small errors would allow the adversary to pull all or most members of $V$ further away from their $U\backslash V$ counterparts. Alternatively, a smaller number of very large errors, so long as they are not concentrated on pairs containing a small group of individuals in $V$ , will be harder for the adversary to take advantage of, because there are many accurate distances making it difficult to “move” elements of $V$ relative to their close counterparts in $U\backslash V$ . We expect that many of the error type questions we would pose for metric learning have a close analogy to the problem of selecting comparison sets in [12].

From a more constructive perspective, we might also find it difficult to compare nontriviality parameters absent understanding of how the submetric will be used. For example, a submetric which preserves distances very well between unqualified individuals but does little to distinguish qualified individuals may not be terribly helpful in deciding between qualified individuals. Developing a more nuanced model for evaluation of submetrics, both from the perspective of abuse and constructing distinguishing classifiers, will be critical to providing good guarantees on submetric use in practice.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Aurélien Bellet, Amaury Habrard and Marc Sebban “A Survey on Metric Learning for Feature Vectors and Structured Data” In Co RR abs/1306.6709 , 2013 ar Xiv: http://arxiv.org/abs/1306.6709
2[2] Alexandra Chouldechova “Fair prediction with disparate impact: A study of bias in recidivism prediction instruments” In ar Xiv preprint ar Xiv:1703.00056 , 2017
3[3] Sanjoy Dasgupta and Michael Luby “Learning from partial correction” In ar Xiv preprint ar Xiv:1705.08076 , 2017
4[4] Cynthia Dwork et al. “Fairness Through Awareness” In Co RR abs/1104.3913 , 2011 URL: http://arxiv.org/abs/1104.3913
5[5] Cynthia Dwork et al. “Learning from Outcomes: Evidence-Consistent Rankings” Manuscript submitted for publication, 2019
6[6] Cynthia Dwork, Ravi Kumar, Moni Naor and Dandapani Sivakumar “Rank aggregation methods for the web” In Proceedings of the 10th international conference on World Wide Web , 2001, pp. 613–622 ACM
7[7] Sorelle A Friedler, Carlos Scheidegger and Suresh Venkatasubramanian “On the (im) possibility of fairness” In ar Xiv preprint ar Xiv:1609.07236 , 2016
8[8] Andrea Frome, Yoram Singer, Fei Sha and Jitendra Malik “Learning globally-consistent local distance functions for shape-based image retrieval and classification” In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on , 2007, pp. 1–8 IEEE

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Metric Learning for Individual Fairness

Abstract

Contents

1 Introduction

Definition 1** (Individual Fairness [4]).**

1.1 Model

Definition 2** (Real-valued distance query).**

Definition 3** (Triplet query).**

1.2 Contributions

1.3 Preliminary terminology and definitions

Definition 4** (α−\alpha-α−submetric).**

Definition 5** ((β,c)−(\beta,c)-(β,c)−nontrivial).**

1.4 Constructing submetrics from arbiter judgments

Lemma 1**.**

Theorem 2**.**

1.5 Choosing good representative elements

Definition 6**.**

Lemma 3**.**

Lemma 4**.**

Definition 7** ((γ,a,b)−(\gamma,a,b)-(γ,a,b)−dense).**

Definition 8** ((p,ζ)−(p,\zeta)-(p,ζ)−diffuse).**

Lemma 5**.**

Lemma 6**.**

1.6 Generalizing arbiter judgments

Definition 9** (Efficient submetric learner).**

Assumption 1**.**

Definition 10** (LinearVote\mathsf{LinearVote}LinearVote).**

Theorem 7**.**

1.7 Relaxing the query model

2 Related Work

3 Additional definitions and terminology

Expanded Query Model

Definition 2 (Real query).

Definition 3 (Triplet query).

Definition 11** (Quad query).**

Additional definitions and lemmas.

Definition 12** (Representative Submetric).**

Lemma 1 (Restatement).

Proof.

Definition 13** (Consistent Underestimator).**

Definition 14** (Representative Consistent Underestimator Submetric).**

Lemma 8**.**

Proof.

Proposition 9**.**

Lemma 10**.**

Proof.

Corollary 10.1**.**

Corollary 10.2**.**

Definition 15** (Representative Set Submetric).**

4 From human judgments to submetrics

Definition 16** (Representative-consistent ordering).**

Definition 17** (α−\alpha-α−consistent threshold underestimator).**

Lemma 11**.**

Proof.

4.1 Constructing metric consistent orderings

Lemma 12**.**

Proof.

Query complexity.

Correctness.

4.2 Constructing α−\alpha-α−submetrics from orderings

Lemma 13**.**

Proof.

Overestimate Error

Theorem 14**.**

Theorem 15**.**

Proof.

Summary

5 Generalization

Definition 9 (Efficient Submetric Learner - Restatement).

5.1 Learnability of threshold functions

Definition 18** (threshold function).**

Assumption 1 (Restatement).

5.2 Constructing submetric learners from threshold learners

Definition 10 (LinearVote\mathsf{LinearVote}LinearVote - Restatement).

Definition 1 (Individual Fairness [4]).

Definition 2 (Real-valued distance query).

Definition 3 (Triplet query).

Definition 4 ( $\alpha-$ submetric).

Definition 5 ( $(\beta,c)-$ nontrivial).

Lemma 1.

Theorem 2.

Definition 6.

Lemma 3.

Lemma 4.

Definition 7 ( $(\gamma,a,b)-$ dense).

Definition 8 ( $(p,\zeta)-$ diffuse).

Lemma 5.

Lemma 6.

Definition 9 (Efficient submetric learner).

Assumption 1.

Definition 10 ( $\mathsf{LinearVote}$ ).

Theorem 7.

Definition 11 (Quad query).

Definition 12 (Representative Submetric).

Definition 13 (Consistent Underestimator).

Definition 14 (Representative Consistent Underestimator Submetric).

Lemma 8.

Proposition 9.

Lemma 10.

Corollary 10.1.

Corollary 10.2.

Definition 15 (Representative Set Submetric).

Definition 16 (Representative-consistent ordering).

Definition 17 ( $\alpha-$ consistent threshold underestimator).

Lemma 11.

Lemma 12.

4.2 Constructing $\alpha-$ submetrics from orderings

Lemma 13.

Theorem 14.

Theorem 15.

Definition 18 (threshold function).

Definition 10 ( $\mathsf{LinearVote}$ - Restatement).

Theorem 16.

Theorem 17.

Corollary 17.1.

Corollary 17.2.

6.3 Distance preservation via $\gamma-$ nets

Corollary 17.3.

Definition 7 ( $(\gamma,a,b)-$ dense - Restatement).

Definition 8 ( $(p,\zeta)-$ diffuse- Restatement).

6.4.1 Nontriviality properties of $\gamma-$ nets

Lemma 18.

Corollary 18.1.

Corollary 5.1.

Corollary 5.2.

Corollary 6.1.

Remark 1.

Theorem 19.

Corollary 19.1.

Corollary 19.2.

Theorem 20.

Definition 19 (Real-valued query (too close to call model)).

Definition 20 (Triplet query (too close to call model)).

Definition 21 (Quad query(too close to call model)).

Theorem 21.

Theorem 22.

Proposition 23.

Definition 22 (Unambiguous threshold distribution).

Lemma 24.

Assumption 2.

Lemma 25.

Theorem 26.

Theorem 27.