Asymptotic Distribution of Centralized $r$ When Sampling from Cauchy
Veson Lee, Jan Vrbik

TL;DR
This paper investigates the asymptotic distribution of the centralized empirical correlation coefficient when sampling from Cauchy distributions, providing new theoretical insights into its behavior as sample size grows large.
Contribution
It derives novel results on the large-sample distribution of the centralized correlation coefficient for Cauchy-distributed variables, a problem previously lacking detailed analysis.
Findings
Derived the asymptotic distribution of the correlation coefficient
Provided analytical results for large sample sizes
Enhanced understanding of correlation behavior with Cauchy data
Abstract
Assume that and are independent random variables, each having a Cauchy distribution with a known median. Taking a random independent sample of size of each and , one can then compute their centralized empirical correlation coefficient . Analytically investigating the sampling distribution of this appears possible only in the large limit; this is what we have done in this article, deriving several new and interesting results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Random Matrices and Applications · Probability and Risk Models
Asymptotic Distribution of Centralized When Sampling from Cauchy
Veson Lee
Department of Mathematics and Statistics
Brock University, Canada
Jan Vrbik
Department of Mathematics and Statistics
Brock University, Canada
(September 15, 2018)
Abstract
Assume that and are independent random variables, each having a Cauchy distribution with a known median. Taking a random independent sample of size of each and , one can then compute their centralized empirical correlation coefficient . Analytically investigating the sampling distribution of this appears possible only in the large limit; this is what we have done in this article, deriving several new and interesting results.
1 Introduction
It can be easily shown, based on Central Limit Theorem, that the sampling distribution of the usual sample correlation coefficient further multiplied by i.e.
[TABLE]
tends to, when and and are independent (both within and between), the standard Normal distribution whenever both and have finite means and variances. The situation changes dramatically when sampling from a Cauchy distribution. Investigating what happens in that case is rather difficult; to simplify the task, we assume that the medians of and are known, and can therefore be subtracted from the and values. This amounts to assuming that the two Cauchy distributions have zero medians and it is then sufficient to define what we call the centralized sample correlation coefficient as
[TABLE]
It is obvious that the second parameter (the quartile deviation) of each of the two Cauchy distributions cancels out of the last expression; we can thus assume (without a loss of generality), that both the and are drawn independently from a Cauchy distribution with median equal to 0 and the quartile deviation equal to 1. The objective of this article is to find the asymptotic (i.e. large-) behaviour of . This centralized is also known to others in the computer science, big data and data science fields as the cosine similarity measure. It has applications in text mining, data mining and information retrieval [1, 2]; an investigation of the statistical distribution related to this centralized has been done by [3].
2 Related Sampling Distributions
To obtain its asymptotic distribution, we must first explore the distribution of the individual terms of . We start by quoting the probability density function (PDF) and the characteristic function (CF) of each and :
[TABLE]
and
[TABLE]
respectively (these are well-known results [4]).
This implies that the PDF each and (denoted ) is given by
[TABLE]
and the distribution function of is given by
[TABLE]
The corresponding CF is then
[TABLE]
and can be found by taking the appropriate Fourier transform of the PDF [5]; here we rely on computer software (such as Maple or Mathematica) to provide these.
To obtain the asymptotic CF of each and (denoted ), we have to raise to the power of , replace by and then take the limit of the resulting expression (note that only by dividing by can one reach a finite limit - that is how these ‘normalizing factors’ are found in general).
This yields
[TABLE]
since for small . The appropriate inverse Fourier transform converts this CF to the corresponding PDF, namely
[TABLE]
One can show via Monte Carlo simulation that this constitutes a fairly accurate approximation for the PDFs of and even for relatively small values (). This is due to the fact that the corresponding CF can be expanded in increasing powers of . The error of the approximation is thus of the type which is faster than of the Central Limit Theorem.
The PDF of W can be readily converted into the (asymptotic) PDF of each and (denoted and equal to ) using a simple univariate transformation, yielding
[TABLE]
resulting in a half-normal distribution.
Finally, each of the (denoted ) has a CF found by
[TABLE]
resulting in
[TABLE]
which corresponds to the relatively simple PDF
[TABLE]
Assuming that each (and also their sum) are asymptotically independent of and (something we have been able to verify only empirically), we now find the CF of
[TABLE]
where and are independent random variables, each having an asymptotic PDF of . This resulting approximate CF is computed by
[TABLE]
where the usual has beeen replaced by , since the resulting distribution is symmetric, implying that it is CF has no imaginary part.
To simplify the integration, we perform it in the usual polar coordinates (denoted and ) getting
[TABLE]
3 Asymptotic PDF of
To get an approximate CF of
[TABLE]
Note the unusual normalizing factor – the only way to achieve a finite limit. We need to first raise to the power of while simultaneously replacing by , taking the limit and finally evaluating the remaining integral. These operations can be carried out in any order in this particular case - something not true in general.
This yields before the integration
[TABLE]
Since the error of this approximation is proportional to , the actual convergence of the subsequent result is expected to be rather slow (reaching a good accuracy only when is at least a few hundred). One can recover a part of this error (trying to recover the full proportional term would make the resulting expression too cumbersome) by replacing with the more accurate
[TABLE]
where
[TABLE]
To complete the computation, we evaluate
[TABLE]
where and are the Bessel and modified Struve functions respectively. Converting to PDF results in:
[TABLE]
which is our final answer for the approximate distribution of the sum of terms, which is the same as
[TABLE]
Note that similarly to the Cauchy distribution itself has an indefinite mean and an infinite variance.
4 Monte Carlo Verification
We will now verify the accuracy of our approximation by randomly generating values of , using . The following Mathematica program does this and plots the corresponding histogram together with our approximate PDF .
G[x_, y_] := x.y/Sqrt[x.y y.y] n = 400; superN = 100 000; data = {}; Do[SeedRandom[]; dataX = RandomReal[CauchyDistribution[], n]; dataY = RandomReal[CauchyDistribution[], n]; AppendTo[data, (n/Log[n]) * Apply[G, {dataX, dataY}]], {superN} ]
a = 1 - Log[Pi]/Log[n]; pdf = (2Log[a + Sqrt[a^2 + x^2]] - Log[x^2]) / (Pi^2Sqrt[a^2 + x^2]); Show[Histogram[data, {-4, 4, 0.25}], Plot[0.25superNpdf, {x,-4,-4}, PlotRange -> {0,0.2*superN}]]
The two results are in good agreement, as can be seen in the Figure 1.
5 Conclusion
We have derived an asymptotic distribution of the centralized sample correlation coefficient when sampling from Cauchy distribution, discovering some of its unusual properties. As a byproduct, several other interesting distributions have been introduced in the process.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Tan, P-N, Steinbach, M, Kumar, Vipin: Introduction to Data Mining. Pearson Addison Wesley, Boston (2005)
- 2[2] Singhal, A: Modern Information Retrieval: A Brief Overview. Bul. IEEE on Data Engineering Vol 24. No 4. 35-43 (2001)
- 3[3] Giller, L. G: The Statistical Properties of Random Bitstreams and the Sampling Distribution of Cosine Similarity. (2012) doi:10.2139/ssrn.2167044
- 4[4] Johnson, N. L, Kotz, S, Balakrishnan, N: Continuous Univariate Distributions, Volume 1, Second Edition. Wiley, New York (1994)
- 5[5] Billingsley, P: Probability and Measure, 3rd Edition. John Wiley & Sons, New York (1995)
