Generating Random Samples from Non-Identical Truncated Order Statistics
Tyler Morrison, Sean Pinkney

TL;DR
This paper introduces an efficient algorithm for sampling from the bounded kth order statistic of independent, non-identically distributed variables, especially effective in sparse density regions with boundary constraints.
Contribution
The paper presents a novel algorithm for generating samples from non-identical truncated order statistics, improving efficiency over rejection sampling in sparse density scenarios.
Findings
Algorithm is faster in sparse density regions.
Effective for tight boundary conditions.
Outperforms rejection sampling in specific cases.
Abstract
We provide an efficient algorithm to generate random samples from the bounded kth order statistic in a sample of independent, but not necessarily identically distributed, random variables. The bounds can be upper or lower bounds and need only hold on the kth order statistic. Furthermore, we require access to the inverse CDF for each statistic in the ordered sample. The algorithm is slightly slower than rejection sampling when the density of the bounded statistic is large, however, it is significantly faster when the bounded density becomes sparse. We provide a practical example and a simulation that shows the superiority of this method for sparse regions arising from tight boundary conditions and/or over regions of low probability density.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic processes and statistical mechanics · Bayesian Methods and Mixture Models · Markov Chains and Monte Carlo Methods
Generating Random Samples from Non-Identical Truncated Order Statistics††thanks: We are grateful for comments by Hattie Young. This research was generously supported by Comscore.
Tyler Morrison [email protected]
Sean Pinkney [email protected].
( March 2024)
Abstract
We provide an efficient algorithm to generate random samples from the bounded -th order statistic in a sample of independent but not necessarily identically distributed random variables. The bounds can be upper or lower bounds and need only hold on the -th order statistic. Furthermore, we require access to the inverse CDF for each statistic in the ordered sample. The algorithm is slightly slower than rejection sampling when the density of the bounded statistic is large, however, it is significantly faster when the bounded density becomes sparse. We provide a practical example and a simulation that shows the superiority of this method for sparse regions arising from tight boundary conditions and/or over regions of low probability density.
1 Introduction
We consider the problem of how to efficiently draw pseudo-random variates from a truncated distribution where the distribution is equivalent to the -th order statistic of a set of independent random variables. The CDF and inverse CDF for each of the distributions is known. The main contribution of this paper is an algorithm that bounds the -dimensional hypercube so that the inverse CDF sampling method may be used (see Devroye [2006]).
Consider independent random variables whose CDFs, , and inverse CDFs, , are known, and we want to draw a random sample from their -th order statistic, satisfying bounds of . As the bounds narrow, rejection sampling struggles to find acceptable samples. Instead of waiting for acceptable samples, we map the samples from the -dimensional unit hypercube onto a restricted subset of another -dimensional unit hypercube that is normalized to represent the restricted probability space. We then apply the inverse CDF to obtain our draws from .
The procedure is useful in a range of applications that support the factorization of a distribution function into known distribution functions not necessarily identical. A frequent example is drawing bounded variates from the minimum or maximum of a sample of order statistics, which arise when modeling censored or truncated data. The 2-dimensional minimum or maximum case is presented as a practical motivating example prior to the general bounded -th order statistic.
2 Two-Dimensional Case
Define where each are independent, and is bounded below by and above by . Further, assume we have access to their CDFs, , and inverse CDFs, . Without the bound restriction, the sample is easily generated by drawing two independent uniform draws from , applying the inverse CDF to each draw and returning the minimum of the two. To satisfy the boundary condition, we restrict the region as
[TABLE]
where and . Since is defined as a minimum, the lower bound does not provide an issue; we simply require that each of the be greater than . The upper bound requires constraining the space such that it is possible for one of the ’s to exceed as long as the other does not. As mentioned above, rejection sampling may be used, but bounds resulting in low-density regions suffer from long running times. Our method draws samples from the unrestricted unit square and maps the results onto the restricted subspace.
The method is composed of the following steps. First generate a sample from the uniform distribution over the unit square. Then, based on the value of , determine which region this point falls into and transform the coordinates using . These new coordinates fall within with the same probability distribution that rejection sampling would give but satisfies the constraints with only a single draw needed. After the draw, apply the inverse CDFs to the to transform to the original two random variables with . Lastly, take the minimum to obtain a sample from our desired distribution: . Figure 1 shows a diagram of the two spaces and their divisions for some example values, graying out the regions where samples in cannot be located.
To partition the restricted space into regions, first note that there are three ways which the could satisfy the bounds: only is less than , both and are less than , or only is less than . is the only bound mentioned because in all scenarios both random variables must be greater than . Create an ordered list of the index combinations for which are less than in each possibility: . Assign coordinates to the axes of our unit square, and divide it into three regions for each where
[TABLE]
This defines the region to be the one where the coordinates with indices in lie within their boundaries, and the other coordinates exceed them. We can see that this division of the square forms a nonoverlapping cover of , i.e. and . In order to map into these sets without disrupting the probability distribution, we need to know the relative volumes of these regions. We see that for region , its volume is . Furthermore, the volume of the region we cut out of the unit square to create the region is , so we know that the total volume of region is . Thus, the fraction of the total volume which is occupied by is . Because we are working with a uniform distribution, we need this fraction to be the probability that our final uniform sample came from region .
We want to start with a uniform draw over an unrestricted unit square which we will call . Call the coordinates of this sample , where . We need to find a mapping from to which preserves the relative probabilities of falling into the regions . To do this, we can cut along coordinate to create three regions with volumes in our new space so that we can map region into and preserve this probability. Thus, in our unrestricted space, we can define:
[TABLE]
so that the total volume of is . Once we have this division, it is easy to create the mapping from to : where
[TABLE]
where the function indicates that this is only to be used to map between points in and , where if . Define the final mapping to be when .
3 General Procedure
The following is the general procedure for random variables of which we want the truncated -th order statistic. Denote the unrestricted -dimensional unit hypercube from which we will draw our original uniform sample as , and let points in this space have coordinates given by . Let the bound restricted hypercube be with points having coordinates . Our original bounds, and , apply to the final and not the uniform samples. By applying the CDFs, and , we obtain the appropriate limits on . Using these definitions, we can see that the restricted region from which we need to generate uniform samples can be written as:
[TABLE]
Our procedure divides and into the corresponding regions, and , and determines the mapping of points from to . After the mapping we can use ordinary inverse transform sampling to generate our draws from .
3.1 Space Segmentation
Each draw must satisfy the constraint that at least of the must be less in order for to be less than , and at least of them must be greater than in order for to be greater than . To accomplish the task, we enumerate all the possible combinations of allowed orderings. To accomplish this, let us ease the notation of each to be represented by its index , and let as the set of all possible indices. Then, we can define two sets composed of subsets of
[TABLE]
where is the power set of . The first set, , is the list of possible combinations of the which could fulfill the bound and likewise for . However, not every pair of and is a valid combination. For instance, if and we have two sets , then we see that and . This means that and but since , then we have a contradiction. In order to avoid such problems, we must impose two restrictions on the pair: so that no index is omitted and so that there is a value satisfying the bounds to be our -th order statistic. We define the set of allowed index combinations to be:
[TABLE]
Assume that this set is given an order and can therefore be indexed by integers between and .
Using the above index list, we can divide into smaller regions indexed by the elements , where
[TABLE]
Thus, is the region of the unit hypercube in which all for are greater than and all for are less than . It is easy to see that and are non-overlapping for since changing any index’s presence in either set changes the allowed range of values. Additionally, the union of these regions is the total restricted region since there are no further combinations which give us a -th order statistic in the allowed range of values.
In defining the regions corresponding to the , we ensure that the probability distribution after the mapping is uniform over to conduct inverse transform sampling. The probability densities over both and follow a multivariate uniform distribution. Because the density function is inversely proportional to the volume, using our mapping we can maintain the relative volumes of the regions being mapped to one another. Then the volume of the regions in , , are equal to the normalized volumes in the restricted regions in , . The normalizing of the volumes begins with the un-normalized volume of , , which is given by
[TABLE]
Then, the normalized volume is the fraction of the allowed volume occupied by the region and is given by .
Then, we can define the regions in so that they have volume
[TABLE]
Clearly, our region has volume , and, similarly to , and are non-overlapping and the union of all of the comprises the whole space .
3.2 Mapping Between Spaces
Now we have defined a set of allowed orderings of the random variables as whose elements identify regions and . Once we have our uniformly sampled point from , we use the above definition of the regions to determine which our point falls into. Once we know this , we can use the following mapping to translate the point into coordinates . For coordinate index and region index , we have the mapping
[TABLE]
After performing this mapping on each of the coordinates in our original sampled point, we obtain . Using the inverse CDFs, we can use these coordinates to obtain samples from the : . Then our final answer is the -th largest of these values, which due to our procedure is guaranteed to be between and without biasing the distribution.
4 Simulation
We can demonstrate the validity and computational speed of the above method by simulating draws in R.111R code for performing these procedures is available upon request. For the following simulations, we choose and , and we use the following probability distributions for the individual :
- •
- •
- •
- •
- •
We perform draws using both our new method and ordinary rejection sampling. First, we can compare the empirical CDFs generated from draws using both methods on a variety of ranges.
In Figure 2, we see that there is great agreement between the CDFs as determined by the two methods, to the level expected for draws. Once certain that our method is accurately drawing from the truncated order statistic distribution, we can assess how long it takes relative to rejection sampling. In determining timing, we used a Dell Latitude E7450 running Windows 10 Pro with a 2.6 GHz Intel Core i7 processor and 16GB 1600 MHz DDR3 memory. For the timing comparison, we used six different intervals as bounds and sampled from each interval , , , , and times, comparing the time it takes to draw all of the samples.
In Figure 3(a), we show that both rejection sampling and our new method behave similarly as the number of samples drawn is increased. This behavior is approximately linear, which makes sense due to the fact that any overhead time would be negligible in comparison with the amount of time per draw, though this overhead time for our new method can be seen in the relatively higher time taken for only one sample. In addition to the number of samples drawn, we need to consider the probability density between the bounds being considered.
Figure 3(b) shows how the execution time varies as the bounds are changed. The bounds themselves are not of interest here since an interval of fixed length can be very easy to sample from if near the peak of the distribution or hard to do if in the tail. Instead, the area of the PDF in that interval is what controls how long rejection sampling may take. The new method is less efficient than rejection sampling when the PDF area is above 5-10%, but below that it can take considerably less time. The extra time taken by the new method is largely due to the extra computations involved in the mapping and determining the mapping groups. The running time of the new method should not vary with the PDF area since every draw leads to a valid sample, which is supported by the figure, but rejection sampling varies by multiple orders of magnitude as the area gets smaller due to the number of draws needed to find a valid one. If a fraction of the total PDF area is available within the bounds, then it will take draws to find a valid sample on average from rejection sampling. The smallest PDF areas used here were 0.1% of the total area, but as this gets lower (which could occur for narrow ranges or near the tails of a distribution) the time taken should continue to increase dramatically.
5 Conclusion
By mapping unrestricted uniform draws from the -dimensional unit hypercube onto a restricted subspace based on given lower and upper bounds, we can utilize inverse transform sampling to simulate draws from the -th order statistic of random variables with known CDFs and inverse CDFs. This procedure results in an unbiased draw from this distribution, and in the case where the PDF density in the region of interest is low, the speed of execution is much faster than rejection sampling. This method could be used in place of rejection sampling to ensure that the execution will take a known amount of time if the density of the PDF is unknown in the bounded volume. A practical example would be in drawing from the tail of the minimum or maximum of a set of random variables which are not identically distributed; in such a case, this method would dramatically outperform rejection sampling.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bapat and Beg [1989] R. B. Bapat and M. I. Beg. Order Statistics for Nonidentically Distributed Variables and Permanents. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002) , 51(1):79–93, Feb 1989. ISSN 0581-572X. doi: 10.2307/25050725 .
- 2Devroye [2006] Luc Devroye. Nonuniform random variate generation. Handbooks in operations research and management science , 13:83–121, 2006.
- 3Reiher [1966] W. Reiher. Hammersley, J. M., D. C. Handscomb: Monte Carlo Methods. Methuen & Co., London, and John Wiley & Sons, New York, 1964. VII + 178 S., Preis: 25 s. Biom. J. , 8(3):209, Jan 1966. ISSN 0006-3452. doi: 10.1002/bimj.19660080314 .
- 4Rubinstein [1982] R. Y. Rubinstein. Generating random vectors uniformly distributed inside and on the surface of different regions. Eur. J. Oper. Res. , 10(2):205–209, Jun 1982. ISSN 0377-2217. doi: 10.1016/0377-2217(82)90161-8 .
- 5Schmeiser [1980] Bruce W Schmeiser. Random variate generation: A survey. Technical report, Purdue Univeristy Lafayette in School of Industrial Engineering, 1980.
