Asymptotic Distribution of Centralized $r$ When Sampling from Cauchy

Veson Lee; Jan Vrbik

arXiv:1812.10596·math.ST·December 31, 2018

Asymptotic Distribution of Centralized $r$ When Sampling from Cauchy

Veson Lee, Jan Vrbik

PDF

Open Access

TL;DR

This paper investigates the asymptotic distribution of the centralized empirical correlation coefficient when sampling from Cauchy distributions, providing new theoretical insights into its behavior as sample size grows large.

Contribution

It derives novel results on the large-sample distribution of the centralized correlation coefficient for Cauchy-distributed variables, a problem previously lacking detailed analysis.

Findings

01

Derived the asymptotic distribution of the correlation coefficient

02

Provided analytical results for large sample sizes

03

Enhanced understanding of correlation behavior with Cauchy data

Abstract

Assume that $X$ and $Y$ are independent random variables, each having a Cauchy distribution with a known median. Taking a random independent sample of size $n$ of each $X$ and $Y$ , one can then compute their centralized empirical correlation coefficient $r$ . Analytically investigating the sampling distribution of this $r$ appears possible only in the large $n$ limit; this is what we have done in this article, deriving several new and interesting results.

Equations48

\frac{n \cdot \sum _{i = 1}^{n} ( X _{i} Y _{i} - \frac{\sum _{i = 1}^{n} X _{i} \cdot \sum _{i = 1}^{n} Y _{i}}{n} )}{\sum _{i = 1}^{n} X _{i}^{2} - \frac{( \sum _{i = 1}^{n} X _{i} ) ^{2}}{n} \sum _{i = 1}^{n} Y _{i}^{2} - \frac{( \sum _{i = 1}^{n} Y _{i} ) ^{2}}{n}}

\frac{n \cdot \sum _{i = 1}^{n} ( X _{i} Y _{i} - \frac{\sum _{i = 1}^{n} X _{i} \cdot \sum _{i = 1}^{n} Y _{i}}{n} )}{\sum _{i = 1}^{n} X _{i}^{2} - \frac{( \sum _{i = 1}^{n} X _{i} ) ^{2}}{n} \sum _{i = 1}^{n} Y _{i}^{2} - \frac{( \sum _{i = 1}^{n} Y _{i} ) ^{2}}{n}}

r_{c} = \frac{\sum _{i = 1}^{n} X _{i} Y _{i}}{\sum _{i = 1}^{n} X _{i}^{2} \sum _{i = 1}^{n} Y _{i}^{2}}

r_{c} = \frac{\sum _{i = 1}^{n} X _{i} Y _{i}}{\sum _{i = 1}^{n} X _{i}^{2} \sum _{i = 1}^{n} Y _{i}^{2}}

f (x) = \frac{1}{π ( x ^{2} + 1 )}

f (x) = \frac{1}{π ( x ^{2} + 1 )}

φ (t) = exp (- ∣ t ∣)

φ (t) = exp (- ∣ t ∣)

f (z) = ⎩ ⎨ ⎧ \frac{1}{π z ( z + 1 )} 0 z > 0 elsewhere

f (z) = ⎩ ⎨ ⎧ \frac{1}{π z ( z + 1 )} 0 z > 0 elsewhere

F_{X^{2}} (z) = Pr (X^{2} \leq z) = \frac{1}{π} \int_{- z}^{z} \frac{d x}{1 + x ^{2}} = \frac{2}{π} arctan z

F_{X^{2}} (z) = Pr (X^{2} \leq z) = \frac{1}{π} \int_{- z}^{z} \frac{d x}{1 + x ^{2}} = \frac{2}{π} arctan z

φ_{X^{2}} (t) = exp (- i t) [1 - erf (- i t)]

φ_{X^{2}} (t) = exp (- i t) [1 - erf (- i t)]

φ_{W} (t) = exp (- 2 \frac{- i t}{π})

φ_{W} (t) = exp (- 2 \frac{- i t}{π})

f (w) = ⎩ ⎨ ⎧ \frac{exp ( - \frac{1}{π w} )}{π w ^{\frac{3}{2}}} 0 w > 0 elsewhere

f (w) = ⎩ ⎨ ⎧ \frac{exp ( - \frac{1}{π w} )}{π w ^{\frac{3}{2}}} 0 w > 0 elsewhere

f (u) = ⎩ ⎨ ⎧ \frac{exp ( - \frac{u ^{2}}{π} )}{π} 0 u > 0 elsewhere

f (u) = ⎩ ⎨ ⎧ \frac{exp ( - \frac{u ^{2}}{π} )}{π} 0 u > 0 elsewhere

\int_{- \infty}^{\infty} \int_{- \infty}^{\infty} exp (i x y t) \cdot \frac{1}{π ( x ^{2} + 1 )} \cdot \frac{1}{π ( y ^{2} + 1 )} d x d y

\int_{- \infty}^{\infty} \int_{- \infty}^{\infty} exp (i x y t) \cdot \frac{1}{π ( x ^{2} + 1 )} \cdot \frac{1}{π ( y ^{2} + 1 )} d x d y

φ_{S} (t) = {\frac{2 ( sin ∣ t ∣ Ci ∣ t ∣ - cos ∣ t ∣ Si ∣ t ∣ )}{π} + cos ∣ t ∣ 1 t \neq = 0 t = 0

φ_{S} (t) = {\frac{2 ( sin ∣ t ∣ Ci ∣ t ∣ - cos ∣ t ∣ Si ∣ t ∣ )}{π} + cos ∣ t ∣ 1 t \neq = 0 t = 0

f (s) = ⎩ ⎨ ⎧ \frac{ln ( s ^{2} )}{π ^{2} ( s ^{2} - 1 )} \frac{1}{π ^{2}} 0 v = R ∖ {- 1, 0, 1} v = {- 1, 1} elsewhere

f (s) = ⎩ ⎨ ⎧ \frac{ln ( s ^{2} )}{π ^{2} ( s ^{2} - 1 )} \frac{1}{π ^{2}} 0 v = R ∖ {- 1, 0, 1} v = {- 1, 1} elsewhere

X_{i} Y_{i} \cdot \frac{n}{\sum _{i = 1}^{n} X _{i}^{2}} \cdot \frac{n}{\sum _{i = 1}^{n} Y _{i}^{2}} ≃ S \cdot U_{1} \cdot U_{2}

X_{i} Y_{i} \cdot \frac{n}{\sum _{i = 1}^{n} X _{i}^{2}} \cdot \frac{n}{\sum _{i = 1}^{n} Y _{i}^{2}} ≃ S \cdot U_{1} \cdot U_{2}

\frac{4}{π ^{4}} \int_{0}^{\infty} \int_{0}^{\infty} \int_{- \infty}^{\infty} cos (u_{1} u_{2} s t) \cdot exp (- \frac{u _{1}^{2} + u _{2}^{2}}{π}) \cdot \frac{ln s ^{2}}{s ^{2} - 1} d s d u_{1} d u_{2}

\frac{4}{π ^{4}} \int_{0}^{\infty} \int_{0}^{\infty} \int_{- \infty}^{\infty} cos (u_{1} u_{2} s t) \cdot exp (- \frac{u _{1}^{2} + u _{2}^{2}}{π}) \cdot \frac{ln s ^{2}}{s ^{2} - 1} d s d u_{1} d u_{2}

\frac{4}{π ^{4}} \int_{0}^{\infty} in t_{0}^{\frac{π}{2}} \int_{- \infty}^{\infty} R cos (s t R^{2} \frac{sin 2Θ}{2}) \cdot exp (- \frac{R ^{2}}{π}) \cdot \frac{ln s ^{2}}{s ^{2} - 1} d s d Θ d R

\frac{4}{π ^{4}} \int_{0}^{\infty} in t_{0}^{\frac{π}{2}} \int_{- \infty}^{\infty} R cos (s t R^{2} \frac{sin 2Θ}{2}) \cdot exp (- \frac{R ^{2}}{π}) \cdot \frac{ln s ^{2}}{s ^{2} - 1} d s d Θ d R

= \frac{2}{π ^{3}} \int_{0}^{\frac{π}{2}} \int_{- \infty}^{\infty} \frac{1}{1 + ( π s t \frac{s i n 2Θ}{2} ) ^{2}} \cdot \frac{ln s ^{2}}{s ^{2} - 1} d s d Θ

= \frac{2}{π} \int_{0}^{\frac{π}{2}} \frac{1 + ∣ t ∣ sin 2Θ ln ( π ∣ t ∣ \frac{s i n 2Θ}{2} )}{1 + ( π t \frac{s i n 2Θ}{2} ) ^{2}} d Θ

\frac{\sum _{i = 1}^{n} X _{i} Y _{i}}{n ln n} \cdot \frac{n}{\sum _{i = 1}^{n} X _{i}^{2}} \cdot \frac{n}{\sum _{i = 1}^{n} Y _{i}^{2}} .

\frac{\sum _{i = 1}^{n} X _{i} Y _{i}}{n ln n} \cdot \frac{n}{\sum _{i = 1}^{n} X _{i}^{2}} \cdot \frac{n}{\sum _{i = 1}^{n} Y _{i}^{2}} .

exp (- ∣ t ∣ sin 2Θ) + O (\frac{1}{ln n})

exp (- ∣ t ∣ sin 2Θ) + O (\frac{1}{ln n})

exp (- a ∣ t ∣ sin 2Θ)

exp (- a ∣ t ∣ sin 2Θ)

a = 1 - \frac{ln π}{ln n} .

a = 1 - \frac{ln π}{ln n} .

φ_{r_{c}} (t) = \frac{2}{π} \int_{0}^{\frac{π}{2}} exp (- a ∣ t ∣ sin 2Θ) d Θ = I_{0} (a ∣ t ∣) - L_{0} (a ∣ t ∣)

φ_{r_{c}} (t) = \frac{2}{π} \int_{0}^{\frac{π}{2}} exp (- a ∣ t ∣ sin 2Θ) d Θ = I_{0} (a ∣ t ∣) - L_{0} (a ∣ t ∣)

f (r_{c}) = \frac{2 ln ( a + a ^{2} + r _{c}^{2} ) - ln r _{c}^{2}}{π ^{2} a ^{2} + r _{c}^{2}}

f (r_{c}) = \frac{2 ln ( a + a ^{2} + r _{c}^{2} ) - ln r _{c}^{2}}{π ^{2} a ^{2} + r _{c}^{2}}

\frac{n \cdot r _{c}}{ln n} .

\frac{n \cdot r _{c}}{ln n} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Random Matrices and Applications · Probability and Risk Models

Full text

Asymptotic Distribution of Centralized $r$ When Sampling from Cauchy

Veson Lee

Department of Mathematics and Statistics

Brock University, Canada

Jan Vrbik

Department of Mathematics and Statistics

Brock University, Canada

(September 15, 2018)

Abstract

Assume that $X$ and $Y$ are independent random variables, each having a Cauchy distribution with a known median. Taking a random independent sample of size $n$ of each $X$ and $Y$ , one can then compute their centralized empirical correlation coefficient $r$ . Analytically investigating the sampling distribution of this $r$ appears possible only in the large $n$ limit; this is what we have done in this article, deriving several new and interesting results.

1 Introduction

It can be easily shown, based on Central Limit Theorem, that the sampling distribution of the usual sample correlation coefficient further multiplied by $\sqrt{n}$ i.e.

[TABLE]

tends to, when $n\to\infty$ and $X_{i}$ and $Y_{i}$ are independent (both within and between), the standard Normal distribution whenever both $X$ and $Y$ have finite means and variances. The situation changes dramatically when sampling from a Cauchy distribution. Investigating what happens in that case is rather difficult; to simplify the task, we assume that the medians of $X$ and $Y$ are known, and can therefore be subtracted from the $X_{i}$ and $Y_{i}$ values. This amounts to assuming that the two Cauchy distributions have zero medians and it is then sufficient to define what we call the centralized sample correlation coefficient as

[TABLE]

It is obvious that the second parameter (the quartile deviation) of each of the two Cauchy distributions cancels out of the last expression; we can thus assume (without a loss of generality), that both the $X_{i}$ and $Y_{i}$ are drawn independently from a Cauchy distribution with median equal to 0 and the quartile deviation equal to 1. The objective of this article is to find the asymptotic (i.e. large- $n$ ) behaviour of $r_{c}$ . This centralized $r_{c}$ is also known to others in the computer science, big data and data science fields as the cosine similarity measure. It has applications in text mining, data mining and information retrieval [1, 2]; an investigation of the statistical distribution related to this centralized $r_{c}$ has been done by [3].

2 Related Sampling Distributions

To obtain its asymptotic distribution, we must first explore the distribution of the individual terms of $r_{c}$ . We start by quoting the probability density function (PDF) and the characteristic function (CF) of each $X_{i}$ and $Y_{i}$ :

[TABLE]

and

[TABLE]

respectively (these are well-known results [4]).

This implies that the PDF each $X_{i}^{2}$ and $Y_{i}^{2}$ (denoted $Z$ ) is given by

[TABLE]

and the distribution function of $Z$ is given by

[TABLE]

The corresponding CF is then

[TABLE]

and can be found by taking the appropriate Fourier transform of the PDF [5]; here we rely on computer software (such as Maple or Mathematica) to provide these.

To obtain the asymptotic CF of each $\lim_{n\to\infty}\frac{\sum_{i=1}^{n}X_{i}^{2}}{n^{2}}$ and $\lim_{n\to\infty}\frac{\sum_{i=1}^{n}Y_{i}^{2}}{n^{2}}$ (denoted $W$ ), we have to raise $\eqref{eq:cf_cauchy_sq}$ to the power of $n$ , replace $t$ by $\frac{t}{n^{2}}$ and then take the $n\to\infty$ limit of the resulting expression (note that only by dividing by $n^{2}$ can one reach a finite limit - that is how these ‘normalizing factors’ are found in general).

This yields

[TABLE]

since $\textrm{erf}{\left(x\right)}\simeq\frac{2x}{\sqrt{\pi}}$ for small $x$ . The appropriate inverse Fourier transform converts this CF to the corresponding PDF, namely

[TABLE]

One can show via Monte Carlo simulation that this constitutes a fairly accurate approximation for the PDFs of $\frac{\sum_{i=1}^{n}X_{i}^{2}}{n^{2}}$ and $\frac{\sum_{i=1}^{n}Y_{i}^{2}}{n^{2}}$ even for relatively small $n$ values ( $n\geq 30$ ). This is due to the fact that the corresponding CF can be expanded in increasing powers of $\frac{1}{n}$ . The error of the approximation is thus of the $O\left(\frac{1}{n}\right)$ type which is faster than $O\left(\frac{1}{\sqrt{n}}\right)$ of the Central Limit Theorem.

The PDF of W can be readily converted into the (asymptotic) PDF of each $\lim_{n\to\infty}\frac{n}{\sqrt{\sum_{i=1}^{n}X_{i}^{2}}}$ and $\lim_{n\to\infty}\frac{n}{\sqrt{\sum_{i=1}^{n}Y_{i}^{2}}}$ (denoted $U$ and equal to $\frac{1}{\sqrt{W}}$ ) using a simple univariate transformation, yielding

[TABLE]

resulting in a half-normal distribution.

Finally, each of the $X_{i}Y_{i}$ (denoted $S$ ) has a CF found by

[TABLE]

resulting in

[TABLE]

which corresponds to the relatively simple PDF

[TABLE]

Assuming that each $X_{i}Y_{i}$ (and also their sum) are asymptotically independent of $\frac{n}{\sqrt{\sum_{i=1}^{n}X_{i}^{2}}}$ and $\frac{n}{\sqrt{\sum_{i=1}^{n}Y_{i}^{2}}}$ (something we have been able to verify only empirically), we now find the CF of

[TABLE]

where $U_{1}$ and $U_{2}$ are independent random variables, each having an asymptotic PDF of $\eqref{eq:pdf_denom_asymp}$ . This resulting approximate CF is computed by

[TABLE]

where the usual $\exp(iu_{1}u_{2}st)$ has beeen replaced by $\cos(u_{1}u_{2}st)$ , since the resulting distribution is symmetric, implying that it is CF has no imaginary part.

To simplify the $\operatorname{d}\!{u}_{1}\operatorname{d}\!{u}_{2}$ integration, we perform it in the usual polar coordinates (denoted $R$ and $\Theta$ ) getting

[TABLE]

3 Asymptotic PDF of $r_{c}$

To get an approximate CF of

[TABLE]

Note the unusual normalizing factor – the only way to achieve a finite limit. We need to first raise $\eqref{eq:cf_rc_term}$ to the power of $n$ while simultaneously replacing $t$ by $\frac{t}{n\ln{n}}$ , taking the $n\to\infty$ limit and finally evaluating the remaining integral. These operations can be carried out in any order in this particular case - something not true in general.

This yields before the $\operatorname{d}\!{\Theta}$ integration

[TABLE]

Since the error of this approximation is proportional to $\frac{1}{\ln{n}}$ , the actual convergence of the subsequent result is expected to be rather slow (reaching a good accuracy only when $n$ is at least a few hundred). One can recover a part of this error (trying to recover the full $\frac{1}{\ln{n}}$ proportional term would make the resulting expression too cumbersome) by replacing $\eqref{eq:cf_rc_asymp}$ with the more accurate

[TABLE]

where

[TABLE]

To complete the computation, we evaluate

[TABLE]

where $I_{0}$ and $L_{0}$ are the Bessel and modified Struve functions respectively. Converting to PDF results in:

[TABLE]

which is our final answer for the approximate distribution of the sum of $\eqref{eq:rv_rc_sum}$ terms, which is the same as

[TABLE]

Note that similarly to the Cauchy distribution itself $\eqref{eq:pdf_rc_aymp}$ has an indefinite mean and an infinite variance.

4 Monte Carlo Verification

We will now verify the accuracy of our approximation by randomly generating $100,000$ values of $\eqref{eq:rv_rc_normalized}$ , using $n=400$ . The following Mathematica program does this and plots the corresponding histogram together with our approximate PDF $\eqref{eq:pdf_rc_aymp}$ .

G[x_, y_] := x.y/Sqrt[x.y y.y] n = 400; superN = 100 000; data = {}; Do[SeedRandom[]; dataX = RandomReal[CauchyDistribution[], n]; dataY = RandomReal[CauchyDistribution[], n]; AppendTo[data, (n/Log[n]) * Apply[G, {dataX, dataY}]], {superN} ]

a = 1 - Log[Pi]/Log[n]; pdf = (2Log[a + Sqrt[a^2 + x^2]] - Log[x^2]) / (Pi^2Sqrt[a^2 + x^2]); Show[Histogram[data, {-4, 4, 0.25}], Plot[0.25superNpdf, {x,-4,-4}, PlotRange -> {0,0.2*superN}]]

The two results are in good agreement, as can be seen in the Figure 1.

5 Conclusion

We have derived an asymptotic distribution of the centralized sample correlation coefficient when sampling from Cauchy distribution, discovering some of its unusual properties. As a byproduct, several other interesting distributions have been introduced in the process.

Bibliography5

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Tan, P-N, Steinbach, M, Kumar, Vipin: Introduction to Data Mining. Pearson Addison Wesley, Boston (2005)
2[2] Singhal, A: Modern Information Retrieval: A Brief Overview. Bul. IEEE on Data Engineering Vol 24. No 4. 35-43 (2001)
3[3] Giller, L. G: The Statistical Properties of Random Bitstreams and the Sampling Distribution of Cosine Similarity. (2012) doi:10.2139/ssrn.2167044
4[4] Johnson, N. L, Kotz, S, Balakrishnan, N: Continuous Univariate Distributions, Volume 1, Second Edition. Wiley, New York (1994)
5[5] Billingsley, P: Probability and Measure, 3rd Edition. John Wiley & Sons, New York (1995)

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Asymptotic Distribution of Centralized rrr When Sampling from Cauchy

Abstract

1 Introduction

2 Related Sampling Distributions

3 Asymptotic PDF of rcr_{c}rc​

4 Monte Carlo Verification

5 Conclusion

Asymptotic Distribution of Centralized $r$ When Sampling from Cauchy

3 Asymptotic PDF of $r_{c}$