A fast algorithm for maximal propensity score matching
Pavel S. Ruzankin

TL;DR
This paper introduces a fast algorithm for maximal propensity score matching that efficiently finds the largest set of matched pairs with caliper constraints, improving existing matching techniques in terms of speed and optimality.
Contribution
The paper presents a novel algorithm for maximal propensity score matching that handles variable calipers and 1-to-n matching efficiently, advancing current matching methods.
Findings
Matching with the new algorithm requires O(N) operations for ordered data.
The algorithm can handle variable width calipers as Lipschitz functions.
It improves upon greedy nearest neighbor matching in speed and optimality.
Abstract
We present a new algorithm which detects the maximal possible number of matched disjoint pairs satisfying a given caliper when a bipartite matching is done with respect to a scalar index (e.g., propensity score), and constructs a corresponding matching. Variable width calipers are compatible with the technique, provided that the width of the caliper is a Lipschitz function of the index. If the observations are ordered with respect to the index then the matching needs operations, where is the total number of subjects to be matched. The case of 1-to- matching is also considered. We offer also a new fast algorithm for optimal complete one-to-one matching on a scalar index when the treatment and control groups are of the same size. This allows us to improve greedy nearest neighbor matching on a scalar index. Keywords: propensity score matching, nearest neighbor matching,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A fast algorithm for maximal propensity score matching††thanks:
This work was supported by the Russian Foundation for Basic Research, under grants 18-01-00074 and 19-07-00397; and by the Program for fundamental scientific research of the SB RAS, No. I.1.3., under grant 0314-2019-0008.
Pavel S. Ruzankin111email: [email protected]
Abstract
We present a new algorithm which detects the maximal possible number of matched disjoint pairs satisfying a given caliper when a bipartite matching is done with respect to a scalar index (e.g., propensity score), and constructs a corresponding matching. Variable width calipers are compatible with the technique, provided that the width of the caliper is a Lipschitz function of the index. If the observations are ordered with respect to the index then the matching needs operations, where is the total number of subjects to be matched. The case of 1-to- matching is also considered.
We offer also a new fast algorithm for optimal complete one-to-one matching on a scalar index when the treatment and control groups are of the same size. This allows us to improve greedy nearest neighbor matching on a scalar index.
Keywords: propensity score matching, nearest neighbor matching, matching with caliper, variable width caliper.
Introduction
Propensity score matching (PSM) is a statistical method widely used in medicine, biology, and sociology. The method is used to reduce bias in inference due to confounding variables, when random allocation of subjects to the comparison groups is not possible. The method can be used instead of multivariable regressions approach or in conjunction with it. The method was introduced by Rosenbaum and Rubin (1983). The method is based on the Neyman–Rubin model of casual inference (see Rubin, 1974). In the model, we have the observable treatment indicators and the observable outcomes , where is the number of subjects under study. if the -th subject belongs to the treatment group and if the -th subject is in the control group. Following the common terminology, we call the groups which we want to compare, the treatment one and the control one. The Neyman-Rubin model is often called counterfactual because it contains the unobservable variables and , which denote the outcomes for the -th subject had the subject been allocated to the control group or to the treatment group, respectively. We have
[TABLE]
The effect of treatment is defined as
[TABLE]
We also observe the vectors of background variables. In this model the random vectors are independent and identically distributed, but the components in each vector are mutually dependent. Rosenbaum and Rubin (1983) defined the propensity score as
[TABLE]
and proved that
[TABLE]
and
[TABLE]
if
[TABLE]
where the symbol denotes independence, is a vector with the same distribution as of all . (Condition (2) is often called the condition of no unmeasured confounders.) That means that, for a fixed , the random vectors and are equally distributed for and for . That implies that
[TABLE]
To estimate the last integral, we need the observable variables only. The propensity score is usually estimated with logistic regression. Relation (3) is what PSM is based on. The PSM consists in matching pairs of treated and control subjects, such that a treated and a control subject in each pair have close values of propensity score. Basing on relation (1), we suppose that the matched subjects have close distributions of the background variables for and for . Therefore, it is possible to apply, e.g., statistical tests to the matched observations.
The majority of studies that employ PSM use greedy nearest neighbor matching (GNNM) algorithms without or with caliper restriction (see Austin, 2011), the caliper width being constant. The caliper width is the maximal allowed within-pair score distance for matching. GNNM means that we try to match each treated subject to the nearest (in terms of the score distance) yet unmatched control subject. There also exist optimal matching algorithms which minimize average or maximal within-pair score distance, but their main drawback is their time complexity and complexity in the sense of implementation. There seem to be no widely available packages implementing optimal matching under a caliper which can be easily used.
In the present paper we mainly consider matching with caliper, i.e., matching with limiting maximal within-pair score distance.
In Sec. 1 we introduce the main algorithm which matches the maximal possible number of subjects in one-to-one matching under a caliper. Besides, we present a new algorithm for optimal complete one-to-one matching, which allows us to improve GNNM. In Sec. 2 we generalize the main algorithm to one-to-many matching and describe how GNNM can be improved for one-to-many matching. The simulation comparison of the new algorithms with GNNM is presented in Sec. 3. Sec. 4 contains the proofs of optimality for the new algorithms. Sec. 5 considers matching with discontinuous caliper width. Sec. 6 contains some proofs for the complexity of GNNM.
1 One-to-one matching
We consider matching disjoint pairs of subjects from two groups, which we will call, using common terminology, treated and control subjects. In other words, a control subject can be matched to no more than one treated subject and vice versa. We will consider only one-dimensional distance, such as in propensity score matching, when the distance between subjects is the distance between points on the real line corresponding to these subjects, assuming each subject is somehow projected to a unique point on the real line. We will call these points the scores of the subjects. However no assumptions are made on how these points are related to the subjects.
Let , , and , , be the scores of treated and control subjects, and being the total numbers of treated and control subjects, respectively. and may take any values on the real line, not necessarily on the interval . Thus, the algorithms below can be used, e.g., for matching by the logits of propensity scores. Let Let be the caliper for our matching, i.e., we match only pairs such that . We assume that the caliper is Lipschitz in both arguments with constants 1, i.e., for all ,
[TABLE]
We will consider discontinuous calipers in Sec. 5.
For a discussion of situations where caliper constraints are important for balancing the matched groups see Rosenbaum (2017) and Austin (2011). Variable caliper width can be useful in situations when, in some domains of values of the propensity score, there are significantly more controls per a treated subject than in other domains (e.g., see examples in Pimentel et al., 2015b). In such cases we can vary the caliper width depending on the density of the number of controls per a treated subject.
1.1 The main algorithm
A natural problem is to find the maximal number of pairs that can be matched under a given caliper. Though this problem can be solved employing network flow optimization algorithms (e.g., see Hansen and Klopfer, 2006), the known algorithms have complexity not less than if no assumptions on sparsity are made. This approach to matching problems was used, e.g., by Rosenbaum (2012, 2017) and Pimentel et al (2015a, 2015b).
Our main goal is to introduce a fast algorithm detecting the maximal possible number of matched pairs and constructing a corresponding matching. Our algorithm has complexity when both the treated and control subjects are sorted with respect to the score:
[TABLE]
Thus, once we have sorted the observations (which takes operations), we can reasonably fast solve the inverse problem of finding the minimal constant caliper suitable for matching percent of data for a given (e.g., we may want to match at least of data and wish to find out which minimal caliper would be sufficient). For instance, if the score lies on the segment then runs of the algorithm ( operations) yield the accuracy of for the minimal caliper. Indeed, first we can run the algorithm for the caliper and if it matches not less than percent of the data then the needed caliper lies on the segment and, at the next step, we run the algorithm for , otherwise the needed caliper lies on the interval and we take for the next step. Repeating the steps sequentially halving the interval, at step we obtain the interval of length containing the minimal constant caliper suitable for matching percent of the data.
From now on we assume that relation (6) holds, unless nearest neighbor matching is considered.
Let us now introduce the main algorithm. The variable will contain the current number of matched pairs. After the algorithm finishes, contains the maximal possible number of matched pairs. and store the index numbers of treated and control subject, respectively, in the -th matched pair.
We present the algorithm as the following pseudocode:
Algorithm A.
while ( and )
if ()
$M:=M+1$
$A_{M}:=i$
$B_{M}:=j$
$i:=i+1$
$j:=j+1$
else
if ($X_{i}<Y_{j}$)
$i:=i+1$
else
$j:=j+1$
end if
end if
end while
As we see, the algorithm just walks through all the observations and successively collects all feasible pairs.
The algorithm requires operations since in each iteration of the while-loop the variable or or both are increased. Certainly, to apply the algorithm, first we must sort the observations with respect to the score, which requires operations.
Theorem 1
Algorithm A produces the maximal possible number of matched pairs under a caliper satisfying –.
This theorem is proved in Subsection 4.2. Some simulations for the algorithm are presented in Subsection 3.1.
1.2 Optimal complete matching and improving nearest neighbor matching
We will call a one-to-one matching (without replacement) complete if the sizes of the treatment and control groups coincide and all the subjects are matched. A caliper restriction may prevent some subjects from being matched, therefore we will consider only complete matchings without caliper.
Colannino et al. (2007) used observations’ sorting for complete one-to-one matching on a scalar index (without applying a caliper). Their algorithm’s complexity is after the observations are ordered with respect to the score. That algorithm minimizes the cost of matching
[TABLE]
where the sum is taken over all matched pairs , which is equivalent to minimizing the average within-pair score distance.
We offer the following new algorithm of the same complexity, which minimizes that cost along with some other costs, including maximal within-pair score distance.
Algorithm B. Let and the observations be sorted as in . Match to for all .
Theorem 2
Let and the observations be sorted as in . Then, among all complete matchings, Algorithm B minimizes average within-pair score distance as well as the following cost functions: maximal within-pair score distance
[TABLE]
[TABLE]
and
[TABLE]
where is a nondecreasing nonnegative continuous function, is a real number. The maximum and the sums are taken over all matched pairs .
The theorem is proved in Subsection 4.1.
A curious result is that the “opposite” matching yields the “counteroptimal” cost.
Theorem 3
Let and the observations be sorted as in . Match to for all . Then, among all complete matchings, this matching maximizes the costs and , in particular, the average within-pair score distance is maximized.
The theorem is proved in Subsection 4.1.
That shows that if one considers matching on a scalar index then the problem of optimal (but not complete) matching minimizing or maximizing (8) is essentially the problem of choosing the optimal subsets of the observations. After the subsets of treated and control subjects are chosen, it is sufficient just to order the observations.
Improving nearest neighbor matching. Let us apply Theorem 2 to a non-complete one-to-one matching on a scalar index, e.g., GNNM, under the caliper satisfying (4) and (5). Let and be the ordered scores of the matched treated and control observations:
[TABLE]
Then, by Theorem 1, rematching these (matched) observations with Algorithm A will produce, under the caliper , the maximal possible number of pairs, which is . Since Algorithm A goes sequentially through the ordered observations, it will match the observations corresponding to and for each . This proves that matching the observations corresponding to and for each obeys the caliper .
Such rematching is optimal with respect to average and maximal within-pair distances by Theorem 2, and can improve the average and maximal within-pair distances as is shown by simulations in Subsection 3.1.
In other words, to improve some matching, we can rearrange the pairs of matched observations via ordering the matched observations as in (10) and then matching the observations corresponding to and for each . Such rematching does not break the caliper restriction because of the optimality of Algorithm A.
Note also that GNNM with caliper has complexity similar to that of Algorithm A (see Sec. 6). If the observations are ordered as in (6) then sequential GNNM has complexity , while for unordered observations GNNM has complexity .
2 -to- matching
2.1 The main algorithm
Algorithm A can be modified for -to- matching. We assume that a treated subject is to be matched to no more than control subjects, and a control subject must not be matched to more than one treated subject. Some authors call these settings matching with a varying number of controls (e.g., see Pimentel et al 2015b).
The following pseudocode uses the same variables as above. is the number of controls matched to the -th treated subject. The variable corresponds to the current number of controls matched to the -th treated subject.
Algorithm C.
for all
while ( and )
if ()
$k:=k+1$
$M:=M+1$
$A_{M}:=i$
$B_{M}:=j$
$D_{i}:=k$
if ($k=n$)
$k:=0$
$i:=i+1$
end if
$j:=j+1$
else
if ($X_{i}<Y_{j}$)
$k:=0$
$i:=i+1$
else
$j:=j+1$
end if
end if
end while
The complexity is still and does not depend on since, as above, in each iteration of the while-loop the variable or or both are increased.
Theorem 4
Algorithm C maximizes the number of matched control subjects or, in other words, the number of matched pairs for -to- matching under a caliper satisfying –.
The theorem is proved in Subsection 4.2. Note that the algorithm does not maximize the number of matched treated subjects. Some simulations for the algorithm are presented in Subsection 3.2.
2.2 Improving nearest neighbor matching
We can rematch matched observations to improve GNNM for 1-to- matching. We will consider the following GNNM scheme. The matching is done in passes. In each pass, we try to match each treated subject to only one nearest yet unmatched control. Such scheme is aimed at increasing the number of matched treated subjects.
We can improve such matching by applying the argument of Subsection 1.2. For this, we consider each pass as a one-to-one matching and, in each pass, we rematch, as in (10), the observations matched by GNNM in this pass.
Such rematching does not alter the number of matched controls and the number of matched treated subjects. The simulation comparison of this algorithm with GNNM and Algorithm C is in Subsection 3.2.
3 Simulation comparison with nearest neighbor matching
3.1 Simulation for one-to-one matching
In this subsection we compare Algorithm A with one-to-one greedy nearest neighbor matching (GNNM) and GNNM followed by the optimal rematching described in Subsection 1.2. GNNM means that we first try to match match the first treated subject to the nearest control subject, then the second treated subject and so on. The matching is done without replacement. All the three algorithms have similar complexities (see Sec. 6).
We take to be i.i.d. random variables on the interval with the density and to be i.i.d. random variables on the interval with the density . We use the calipers for Algorithm A and for GNNM. Each of the following graphs is constructed by 10,000 simulation runs. In each simulation, the treatment group and control group are of the same size of 100 or 1000.
First we try to compare the numbers of matched pairs for the algorithms in the case when . Fig. 1 depicts the empirical cumulative distribution functions for the numbers of matched pairs. The graphs are plotted for Algorithm A (solid lines) and GNNM (dashed lines). We see that under these settings Algorithm A matches more pairs than GNNM. It makes little sense to compare algorithms that match significantly different numbers of pairs. If one algorithm is allowed to match a smaller number of pairs compared to another algorithm then the former algorithm can easily produce lesser maximal and average distances between the scores of paired observations. On the other hand, lesser numbers of pairs lead to less significant p-values and powers for statistical tests applied to matched observations.
That is why, for the next graphs, we choose some to make the numbers of pairs matched by Algorithm A and GNNM be similar. Fig. 2–4 depict the empirical cumulative distribution functions for the number of matched pairs, the maximal distance between the scores of paired observations, and the average distance between the scores of paired observations, respectively. The graphs are plotted for Algorithm A (solid lines), GNNM (dashed lines) and GNNM with rematching (10) (dotted lines).
The simulation results for the case are summarized in the following table, where the means for the values plotted on Fig. 2–4 are presented:
Mean of:
number of matched pairs
maximal within-pair score distance
average within-pair score distance
Algorithm A
45.5
0.0153
0.0086
GNNM with rematching (10)
45.5
0.0181
0.0060
GNNM
45.5
0.0188
0.0063
The next table presents the means for the case :
Mean of:
number of matched pairs
maximal within-pair score distance
average within-pair score distance
Algorithm A
496.8
0.0065
0.0047
GNNM with rematching (10)
496.6
0.0151
0.0020
GNNM
496.6
0.0196
0.0024
We see that if we want to minimize the average distance between the scores of paired observations then we may choose GNNM with rematching (10). But if we want to minimize the maximal distance between the scores in pairs then we may prefer Algorithm A.
The other argument for choosing Algorithm A may be its complexity. If we have to match “big data”, the complexity may be of more importance than the accuracy of matching.
For explicit practical recommendations an extensive simulation comparison may be needed, like that in Austin (2014).
3.2 Simulation for 1-to-3 matching
In this subsection we compare Algorithm C with GNNM and GNNM with rematching (10) for 1-to-3 matching. GNNM and GNNM with rematching (10) are accomplished according to the scheme described in Subsection 2.2.
As in the above simulations, are i.i.d. random variables on the interval with the density and are i.i.d. random variables on the interval with the density . We use the caliper for Algorithm C and for GNNM. The estimates below are computed by 10,000 simulation runs. In each simulation, the treatment group is of the size or , and the control group is of the size .
The unweighted average within-pair score distance is computed as the average distance among all matched pairs. For computing the weighted average within-pair distance, the weight of each pair is the inverse of the number of controls the current treated subject is matched to. So the sum of the weights of all pairs for a matching is the number of matched treated subjects.
The following table summarizes the simulation results for the case , , :
Mean of:
number of matched pairs
maximal within-pair score distance
weighted average within-pair score distance
unweighted average within-pair score distance
number of matched treated subjects
Algorithm C
141.6
0.01462
0.00298
0.00827
58.8
GNNM with rematching (10)
141.6
0.01888
0.00267
0.00491
71.6
GNNM
141.6
0.01941
0.00282
0.00507
71.6
The following table summarizes the simulation results for the case , , :
Mean of:
number of matched pairs
maximal within-pair score distance
weighted average within-pair score distance
unweighted average within-pair score distance
number of matched treated subjects
Algorithm C
1494.2
0.00550
0.00146
0.00410
604.5
GNNM with rematching (10)
1494.1
0.01797
0.00087
0.00152
748.4
GNNM
1494.1
0.01983
0.00101
0.00170
748.4
For one-to-many matching, one may be interested in maximizing the number of matched treated subjects. If it is the case then GNNM with rematching (10) is to be preferred. While if one is interested in maximizing the number of matched pairs and minimizing the maximal within-pair score distance then he/she is to choose Algorithm C.
4 Proofs of optimality
4.1 Proofs for complete matching
We offer two proofs for Theorem 2. The first proof is substantially based on the results on the Monge–Kantorovich mass trasfer problem. The second proof is straightforward.
The first proof of Theorem 2. Let and be two probability measures on with the cumulative distribution functions and , respectively. Let be a convex nonnegative function. Then (relation (2.14) in Rachev, 1985)
[TABLE]
where the infimum is taken over all random variables and on a common probability space with the distributions and , respectively,
[TABLE]
is a random variable uniformly distributed on ,
[TABLE]
is the quantile transformation of .
We will apply relation (11) to prove the optimality of Algorithm B. Take and such that for all . Then and for . Therefore,
[TABLE]
To prove the optimality (8) it remains to notice that
[TABLE]
where the minimum is taken over all permutations of the set . Hence,
[TABLE]
The optimality (8) is proved. The optimality (9) can be proved analogously by Example after Theorem 2 in Rachev (1985).
Let us now prove the optimality (7). If there are two complete one-to-one matchings and of subjects, such that
[TABLE]
then there exits a such that
[TABLE]
Indeed, it is sufficient to take a such that
[TABLE]
Hence, since the function is convex for any and thus (8) is minimized for any such , the functional is minimized as well.
The theorem is proved.
The second proof of Theorem 2. Let us prove the optimality (8). Let a complete matching match to and to , where but . Thus, but . Hence, we have
[TABLE]
and
[TABLE]
Therefore, we have
[TABLE]
and, since the function is convex and ,
[TABLE]
Hence, replacing in the pairs and by the pairs and does not increase the costs (7) and (8).
Sequentially replacing pairs and such that and by the pairs and , we will finally come to the matching of Algorithm B. (We may use, e.g., bubble sort scheme for it.) That proves that the costs (7) and (8) for Algorithm B are not greater than those of any other compelte matching.
The optimalities (7) and (8) are proved.
Here we omit the proof of optimality (9) for the sake of brevity. One may see the first proof of the theorem for the proof of that optimality.
The theorem is proved.
Proof of Theorem 3 can be done analogously to the first proof of Theorem 2 by virtue of the relation ((2.14) in Rachev, 1985)
[TABLE]
where , , , , , are the same as in the proof of Theorem 2.
There is also a straightforward proof of Theorem 3 analogous to the second proof of Theorem 2.
4.2 Proofs for Algorithms A and C
For a constant caliper, Theorems 1 and 4 can be proved analogously to the first proof of Theorem 2 using the corresponding results on the Monge–Kantorovich problem in Ruzankin (2001). Here we offer the proofs that are valid for variable width calipers as well.
Proof of Theorem 1. There exists a matching satisfying the caliper (i.e., for all ) and containing the maximal number of matched pairs.
Consider the first step. If then
[TABLE]
for all , since by (5), and, hence, the first treated subject cannot be used for matching. Analogously if then the first control subject is not suitable for matching by (4). Thus first steps of the algorithm skip the observations that cannot be used for matching.
After the above operation we can assume, for the sake of convenience, that . Let us show that matching now the first treated with the first control subject, as the algorithm does, does not reduce the maximal number of matched pairs, if we match the maximal number of pairs for the remaining -th treated and -th control subjects.
If the first treated or the first control subject are not matched in then removing from a possible pair with the first treated or the first control subject and then adding to does not change the number of pairs in . Thus, in this case, matching the pair and then matching the maximal number of pairs for the -th treated and -th control subjects yields the total maximal number of matched pairs.
The case when contains the pair is clear.
It remains to consider the case when contains some pairs and , where and . In this case we have and . Therefore
[TABLE]
by (5) and analogously by (4). Hence,
[TABLE]
Thus, removing from the pairs and and adding the pairs and obeys the caliper restriction and does not change the number of pairs in . Again, matching the pair and then matching the maximal number of pairs for the -th treated and -th control subjects yields the total maximal number of matched pairs.
Applying the above argument to the remaining -th treated and -th control observations proves the optimality of Algorithm A by induction.
The theorem is proved.
Proof of Theorem 4. For the case of -to- matching it suffices to consider Algorithm C as Algorithm A applied to observations where we take identical treated subjects instead of each corresponding treated subject from the original observations, i.e., we “repeat” each treated subject times.
5 The case of piecewise Lipschitz caliper
In this section we will describe an algorithm which yields a maximal number of pairs under somewhat weaker conditions on the caliper. We will consider one-to-one matching though it is easy to modify the algorithm below for the case of -to- matching just like it was done for Algorithm A.
We will assume that, first, the caliper is “Lipschitz-nondecreasing”:
[TABLE]
and, second, the caliper is piecewise Lipschitz in both arguments: there exist disjoint intervals covering the domain of and disjoint intervals covering the domain of such that, for each ,
[TABLE]
and, for each ,
[TABLE]
For example, if or , where and are nondecreasing nonnegative step functions, or if , where denotes the greatest integer not greater than , then conditions (12)–(15) are satisfied.
Let us now introduce an algorithm for a caliper satisfying (12)–(15). As above, is the current number of matched pairs. After the algorithm finishes, is the maximal number of matched pairs. and store the index numbers of treated and control subject, respectively, in the -th matched pair.
are increasing numbers such that whenever ; and increasing numbers are such that whenever . Computing and given , , , and requires operations. If some of the intervals contain no observations then we are to take the number of intervals lesser than the number of intervals , but, to simplify notations, we use the same to enumerate , . The same is done for the intervals .
For each , we will have if and only if , the observations are already matched or discarded, and either equals or is currently neither matched nor discarded. Symmetrically, for each , we will have if and only if , the observations are already matched or discarded, and either equals or is currently neither matched nor discarded.
We will assume that “and” in the if-statement means that the second condition is checked only if the first one is true.
Algorithm D.
for all
for all
function ()
if ()
$u1:=u0+1$
while ($u1<U$ and $S_{u1}=I_{u1+1}$) $u1:=u1+1$
$u0:=u1$
$i:=S_{u0}$
end if
end function
function ()
if ()
$v1:=v0+1$
while ($v1<V$ and $T_{v1}=J_{v1+1}$) $v1:=v1+1$
$v0:=v1$
$j:=T_{v0}$
end if
end function
while ( and )
if ()
for ($v=v0,...,V-1$)
if ($T_{v}<J_{v+1}$ and $|X_{i}-Y_{T_{v}}|\leq c(X_{i},Y_{T_{v}})$)
$M:=M+1$
$A_{M}:=i$
$B_{M}:=T_{v}$
$increment\_i$()
if ($v0=v$)
$increment\_j$()
else
$T_{v}:=T_{v}+1$
end if
next while
end if
end for
$increment\_i$()
else
for ($u=u0,...,U-1$)
if ($S_{u}<I_{u+1}$ and $|X_{S_{u}}-Y_{j}|\leq c(X_{S_{u}},Y_{j})$)
$M:=M+1$
$A_{M}:=S_{u}$
$B_{M}:=j$
$increment\_j$()
if ($u0=u$)
$increment\_i$()
else
$S_{u}:=S_{u}+1$
end if
next while
end if
end for
$increment\_j$()
end if
end while
The complexity of the last algorithm is since each iteration of the while-loop requires operations and in each iteration the variable or or both are increased.
The proof for the maximality of the number of matched pairs almost repeats the proof of Theorem 4 for Algorithm C. The main difference that if, say, at some step then we have to check sequentially whether can be matched to each group , , of the control observations. As above, by (15) it is sufficient for each group to check whether can be matched to the first unmatched element of the group. Relations (12) and (13) ensure that matching to the first unmatched element of the first suitable group does not diminish the number of matched pairs below its maximal value.
6 Complexity of nearest neighbor matching
In this section we discuss the complexity of one-to-one greedy nearest neighbor matching (GNNM) under a caliper. We match sequentially the first treated subject, the second one, and so on. The matching is done without replacement. In this section no assumptions on the caliper are made.
Nearest neighbor matching for sorted observations. Let us consider observations sorted as in (6). We want to match the observations by GNNM with the caliper .
This can be done in time if we use a list data structure for the control observations. The list can be organized as the vector containing the controls’ scores, and two integer vectors for left and right pointers of the list cells. (In fact, in this case the vector for right pointers is not needed, since we use the right pointers only to move to the right through the list until we meet the first control with the score not less than that of the current treated subject.)
Nearest neighbor matching for unordered observations. Now we make no assumptions on the order of the observations. For instance, the treated observations may be randomly permuted, the permutations being uniformly distributed. Such a permutation can be done in time.
The GNNM can be done in time by the following algorithm.
First we build a balanced binary tree for the control observations, which requires operations (e.g., see Ruzankin, 2019). Each node of the tree contains the number of the corresponding control observation. The left subtree of each node contains control observations with the scores lesser than or equal to that of the node, and the right subtree contains controls with the scores greater than or equal to that of the node.
The main problem in using such trees for matching is in dealing with already matched observations. We offer the following solution to the problem.
In the process of matching, a node becomes void after the corresponding control observation is matched. The algorithm does not allow a void node to have one or no down edges. So if a leaf node’s observation is matched then the node is removed from the tree and, after that, if the parent of this node is a void node then it is also deleted, its up edge and (the only) down edge being “glued” together. Analogously, if we match an observation from a node that has only one down edge then the node is removed, its up and down edges being “glued”. But if we match a control from a node that has two down edges then this node just becomes void, but keeps containing the number of the corresponding observation.
For each treated subject, the algorithm goes down the tree. At each step of this process there are two to four guesses. Each guess has its score and is the number of a node or a dummy guess. Each guess is flagged as static or branching. Each step transforms the guesses or, when the guesses cannot be transformed, tries to match an observation from the guesses. For the first step we take the root node as a branching guess, and two static dummy guesses with some scores and , respectively.
Let, at some step, we have guesses with the scores for matching a treated observation with the score . First we select from these guesses the left and right guesses. If for some then the -th and -th guesses are assigned to be the left and right guess, respectively. Note that we always have .
If both the left and right guesses are static then we match the one of them closest to , satisfying the caliper, and being not dummy, if any, and then proceed to matching the next treated subject.
Next, if the left or right guess is static then it reproduces itself as a static guess for the next step.
If the left or right guess is a void node then it puts its both children to be branching guesses for the next step. Note that a void guess cannot be static.
If the left guess is not void and branching then it reproduces itself as a static guess for the next step and puts its right child (if any) as a branching guess for the next step.
If the right guess is not void and branching then it reproduces itself as a static guess for the next step and puts its left child (if any) as a branching guess for the next step.
Thus we have two to four guesses prepared for the next step and can proceed to it.
As we see, for each treated subject, we need operations to travel down the tree and select the nearest control neighbor, and then we need operations to remove the corresponding void nodes. Thus the total complexity is .
Acknowledgments
The author is grateful to Ben B. Hansen and Mark M. Fredrickson for the discussion which has substantially improved the paper. The author thanks the reviewers for useful comments.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Austin, P. C. (2011), “An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies”, Multivariate Behavioral Research , 46, No. 3: Propensity Score Analysis, 399–424.
- 2Austin, P. C. (2014), “A comparison of 12 algorithms for matching on the propensity score”, Statist. Med. , 33, 1057–1069.
- 3Colannino, J., Damian, M., Hurtado, F. et al. (2007), “Efficient Many-To-Many Point Matching in One Dimension”, Graphs and Combinatorics , 23(Suppl 1), 169–178.
- 4Hansen, B. B. and Klopfer, S. O. (2006), “Optimal full matching and related designs via network flows”, Journal of Computational and Graphical Statistics , 15, No.3, 609–627.
- 5Pimentel, S. D., Kelz, R. R., Silber, J. H, and Rosenbaum, P. R. (2015 a), “Large, Sparse Optimal Matching With Refined Covariate Balance in an Observational Study of the Health Outcomes Produced by New Surgeons,” Journal of the American Statistical Association, 110, No. 510, 517–527.
- 6Pimentel, S. D., Yoon, F., and Keele, L. (2015 b), “Variable-ratio matching with fine balance in a study of the Peer Health Exchange,” Statistics in Medicine, 34, No. 30, 4070–4082.
- 7Rachev, S. T. (1985), “The Monge-Kantorovich Mass Transference Problem and Its Stochastic Applications”, Theory Probab. Appl. , 29, No. 4, 647–676.
- 8Rosenbaum, P. R. (2012), “Optimal Matching of an Optimally Chosen Subset in Observational Studies,” Journal of Computational and Graphical Statistics, 21, No. 1, 57–71.
