Detection of low dimensionality and data denoising via set estimation   techniques

Catherine Aaron; Alejandro Cholaquidis; Antonio Cuevas

arXiv:1702.05193·math.ST·November 6, 2017

Detection of low dimensionality and data denoising via set estimation techniques

Catherine Aaron, Alejandro Cholaquidis, Antonio Cuevas

PDF

TL;DR

This paper investigates set and manifold estimation from random samples, focusing on identifying lower-dimensional structures and denoising data, with theoretical guarantees and practical illustrations.

Contribution

It introduces methods for determining the dimensionality of sets, estimating lower-dimensional manifolds, and denoising data based on set estimation theories.

Findings

01

Proposes procedures to identify if a set is full-dimensional or lower-dimensional.

02

Develops algorithms to estimate lower-dimensional manifolds from noisy data.

03

Provides theoretical guarantees and simulation results demonstrating effectiveness.

Abstract

This work is closely related to the theories of set estimation and manifold estimation. Our object of interest is a, possibly lower-dimensional, compact set $S \subset R^{d}$ . The general aim is to identify (via stochastic procedures) some qualitative or quantitative features of $S$ , of geometric or topological character. The available information is just a random sample of points drawn on $S$ . The term "to identify" means here to achieve a correct answer almost surely (a.s.) when the sample size tends to infinity. More specifically the paper aims at giving some partial answers to the following questions: is $S$ full dimensional? Is $S$ "close to a lower dimensional set" $M$ ? If so, can we estimate $M$ or some functionals of $M$ (in particular, the Minkowski content of $M$ )? As an important auxiliary tool in the answers of these…

Figures9

Click any figure to enlarge with its caption.

Tables3

Table 1. Table 1: Minimum sample sizes required to detect lower dimensionality for different values of the dimension d 𝑑 d and the width parameter A 𝐴 A .

$A$	$d = 2$	$d = 3$	$d = 4$
0	$\leq 50$	$\leq 50$	$\leq 50$
0.01	$[51, 100]$	$[1001, 2000]$	$> 10000$
0.05	$\leq 50$	$[201, 300]$	$[1001, 2000]$
0.1	$\leq 50$	$[51, 100]$	$[101, 200]$
0.2	$\leq 50$	$\leq 50$	$[51, 100]$
0.3	$\leq 50$	$\leq 50$	$[51, 100]$
0.4	$\leq 50$	$\leq 50$	$\leq 50$
0.5	$\leq 50$	$\leq 50$	$\leq 50$

Table 2. Table 2: Radius r 0 ( n , d ) subscript 𝑟 0 𝑛 𝑑 r_{0}(n,d) for Minkowski contents estimation when R 1 = 0.2 subscript 𝑅 1 0.2 R_{1}=0.2

	$n = 10^{3}$	$n = 10^{4}$	$n = 10^{5}$	$n = 10^{6}$
$d = 2$	$0.11$	$0.1$	$0.08$	$0.07$
$d = 3$	$0.14$	$0.14$	$0.13$	$0.12$
$d = 4$	$0.16$	$0.16$	$0.16$	$0.15$

Table 3. Table 3: Relative errors (in percentage) for Minkowski contents estimation

$d$	$R_{1}$	$n = 10^{3}$	$n = 10^{4}$	$n = 10^{5}$	$n = 10^{6}$
$2$	$0$	$0.38$	$0.34$	$0.32$	$0.3$
$2$	$0.2$	$27.29$	$11.79$	$5.33$	$2.37$
$3$	$0$	$4$	$0.88$	$0.37$	$0.32$
$3$	$0.2$	$37.03$	$28.76$	$19.95$	$11.35$
$4$	$0$	$16.1$	$4.34$	$1.23$	$0.45$
$4$	$0.2$	$91.37$	$54.85$	$26.69$	$25.88$

Equations140

d_{H} (A, C) = in f {ε > 0 : \mbox s u c h t ha t A \subset B (C, ε) \mbox an d C \subset B (A, ε)} .

d_{H} (A, C) = in f {ε > 0 : \mbox s u c h t ha t A \subset B (C, ε) \mbox an d C \subset B (A, ε)} .

C_{r}(S)=\bigcap_{\big{\{}\mathring{\mathcal{B}}(x,r):\ \mathring{\mathcal{B}}(x,r)\cap S=\emptyset\big{\}}}\Big{(}\mathring{\mathcal{B}}(x,r)\Big{)}^{c},

C_{r}(S)=\bigcap_{\big{\{}\mathring{\mathcal{B}}(x,r):\ \mathring{\mathcal{B}}(x,r)\cap S=\emptyset\big{\}}}\Big{(}\mathring{\mathcal{B}}(x,r)\Big{)}^{c},

ν (B (x, ε) \cap S) \geq δ μ_{d} (B (x, ε)), 0 < ε \leq λ .

ν (B (x, ε) \cap S) \geq δ μ_{d} (B (x, ε)), 0 < ε \leq λ .

ν (B (x, ε) \cap S) \geq ν (B (x, ε) \cap B (z, r)) \geq f_{0} μ_{d} (B (x, ε) \cap B (z, r)) \geq \frac{f _{0}}{3} μ_{d} (B (x, ε)) .

ν (B (x, ε) \cap S) \geq ν (B (x, ε) \cap B (z, r)) \geq f_{0} μ_{d} (B (x, ε) \cap B (z, r)) \geq \frac{f _{0}}{3} μ_{d} (B (x, ε)) .

H_{δ}^{r} (E) = in f {j = 1 \sum \infty (diam (B_{j}))^{r} : E \subset \cup_{j = 1}^{\infty} B_{j}, diam (B_{j}) \leq δ},

H_{δ}^{r} (E) = in f {j = 1 \sum \infty (diam (B_{j}))^{r} : E \subset \cup_{j = 1}^{\infty} B_{j}, diam (B_{j}) \leq δ},

dim_{H} (E) = in f {r \geq 0 : H^{r} (E) = 0} = sup ({r \geq 0 : H^{r} (E) = \infty} \cup {0}) .

dim_{H} (E) = in f {r \geq 0 : H^{r} (E) = 0} = sup ({r \geq 0 : H^{r} (E) = \infty} \cup {0}) .

L_{0}^{d^{'}} (M) = ε \to 0 lim \frac{μ _{d} ( B ( M , ε ) )}{ω _{d - d^{'}} ε ^{d - d^{'}}},

L_{0}^{d^{'}} (M) = ε \to 0 lim \frac{μ _{d} ( B ( M , ε ) )}{ω _{d - d^{'}} ε ^{d - d^{'}}},

\hat{S}_{n} (r) = i = 1 ⋃ n B (X_{i}, r) .

\hat{S}_{n} (r) = i = 1 ⋃ n B (X_{i}, r) .

\begin{split}&\text{if }\Big{(}\frac{C\log(n)}{\delta\omega_{d}n}\Big{)}^{1/d}\leq r_{n}\leq\rho_{0}/2\text{ for a given }C>1\text{ then}\text{ eventually a.s. for all }y\in\mathcal{B}(x_{0},2r_{n})\text{ we have }\\ &\mathring{\mathcal{B}}(y,r_{n})\cap\mathcal{X}_{n}\neq\emptyset.\end{split}

\begin{split}&\text{if }\Big{(}\frac{C\log(n)}{\delta\omega_{d}n}\Big{)}^{1/d}\leq r_{n}\leq\rho_{0}/2\text{ for a given }C>1\text{ then}\text{ eventually a.s. for all }y\in\mathcal{B}(x_{0},2r_{n})\text{ we have }\\ &\mathring{\mathcal{B}}(y,r_{n})\cap\mathcal{X}_{n}\neq\emptyset.\end{split}

p_{n}=P_{X}\Big{(}\exists y\in\mathcal{B}(x_{0},2r_{n}),\mathring{\mathcal{B}}(y,r_{n})\cap\mathcal{X}_{n}=\emptyset\Big{)},

p_{n}=P_{X}\Big{(}\exists y\in\mathcal{B}(x_{0},2r_{n}),\mathring{\mathcal{B}}(y,r_{n})\cap\mathcal{X}_{n}=\emptyset\Big{)},

p_{n}\leq\sum_{i=1}^{\nu_{n}}P_{X}\Big{(}\mathcal{B}\big{(}t_{i},r_{n}(1-\varepsilon_{n})\big{)}\cap\mathcal{X}_{n}=\emptyset\Big{)}.

p_{n}\leq\sum_{i=1}^{\nu_{n}}P_{X}\Big{(}\mathcal{B}\big{(}t_{i},r_{n}(1-\varepsilon_{n})\big{)}\cap\mathcal{X}_{n}=\emptyset\Big{)}.

P_{X}\Big{(}\mathcal{B}\big{(}t_{i},r_{n}(1-\varepsilon_{n})\big{)}\cap\mathcal{X}_{n}=\emptyset\Big{)}=\Big{(}1-P_{X}\big{(}\mathcal{B}\big{(}t_{i},r_{n}(1-\varepsilon_{n})\big{)}\Big{)}^{n}.

P_{X}\Big{(}\mathcal{B}\big{(}t_{i},r_{n}(1-\varepsilon_{n})\big{)}\cap\mathcal{X}_{n}=\emptyset\Big{)}=\Big{(}1-P_{X}\big{(}\mathcal{B}\big{(}t_{i},r_{n}(1-\varepsilon_{n})\big{)}\Big{)}^{n}.

\displaystyle P_{X}\Big{(}\mathcal{B}\big{(}t_{i},r_{n}(1-\varepsilon_{n})\big{)}\cap\mathcal{X}_{n}=\emptyset\Big{)}\leq

\displaystyle P_{X}\Big{(}\mathcal{B}\big{(}t_{i},r_{n}(1-\varepsilon_{n})\big{)}\cap\mathcal{X}_{n}=\emptyset\Big{)}\leq

\leq

p_{n}\leq\tau_{d}\varepsilon_{n}^{-d}\Big{(}1-C\frac{\log(n)}{n}\big{(}1-\varepsilon_{n}\big{)}^{d}\Big{)}^{n}\leq\tau_{d}\varepsilon_{n}^{-d}n^{-C(1-\varepsilon_{n})^{d}},

p_{n}\leq\tau_{d}\varepsilon_{n}^{-d}\Big{(}1-C\frac{\log(n)}{n}\big{(}1-\varepsilon_{n}\big{)}^{d}\Big{)}^{n}\leq\tau_{d}\varepsilon_{n}^{-d}n^{-C(1-\varepsilon_{n})^{d}},

∥ n (s) - n (t) ∥ \leq \frac{1}{r _{0}} ∥ s - t ∥ for all s, t \in \partial S

∥ n (s) - n (t) ∥ \leq \frac{1}{r _{0}} ∥ s - t ∥ for all s, t \in \partial S

\frac{nr_{n}^{d}\omega_{d}}{\log(n)\beta^{d}}\rightarrow\max\Big{\{}\frac{1}{f_{0}},\frac{2(d-1)}{df_{1}}\Big{\}}\geq\frac{1}{f_{0}}.

\frac{nr_{n}^{d}\omega_{d}}{\log(n)\beta^{d}}\rightarrow\max\Big{\{}\frac{1}{f_{0}},\frac{2(d-1)}{df_{1}}\Big{\}}\geq\frac{1}{f_{0}}.

r_{n}\geq\Big{(}\frac{\log(n)}{n}\frac{\beta^{d}}{\omega_{d}2f_{0}}\Big{)}^{1/d},

r_{n}\geq\Big{(}\frac{\log(n)}{n}\frac{\beta^{d}}{\omega_{d}2f_{0}}\Big{)}^{1/d},

i max j \neq = i min γ (X_{i}, X_{j}) = O ((\frac{lo g n}{n})^{1/ d^{'}}), \mbox a . s .,

i max j \neq = i min γ (X_{i}, X_{j}) = O ((\frac{lo g n}{n})^{1/ d^{'}}), \mbox a . s .,

\hat{R}_{n} - R_{1} \leq 2 ε_{n} for n large enough,

\hat{R}_{n} - R_{1} \leq 2 ε_{n} for n large enough,

\hat{R}_{n} - R_{1} > C for n large enough .

\hat{R}_{n} - R_{1} > C for n large enough .

\hat{S}_{n} (ε_{n}) \subset B (\overset{˚}{S}, ε_{n}) .

\hat{S}_{n} (ε_{n}) \subset B (\overset{˚}{S}, ε_{n}) .

S \subset \hat{S}_{n} (ε_{n}),

S \subset \hat{S}_{n} (ε_{n}),

\left|\tilde{R}_{n}-R_{1}\right|=\mathcal{O}\big{(}\log(n)/n\big{)}^{\min(1/(d-d^{\prime}),2/(d+1))},\

\left|\tilde{R}_{n}-R_{1}\right|=\mathcal{O}\big{(}\log(n)/n\big{)}^{\min(1/(d-d^{\prime}),2/(d+1))},\

\tilde{R}_{n} - R_{1} > C for n large enough .

\tilde{R}_{n} - R_{1} > C for n large enough .

d_{H}\big{(}\partial C_{r}(\mathcal{Y}_{n}),\partial S\big{)}=\mathcal{O}\big{(}(\log(n)/n)^{2/(d+1)}\big{)},\text{ a.s.}

d_{H}\big{(}\partial C_{r}(\mathcal{Y}_{n}),\partial S\big{)}=\mathcal{O}\big{(}(\log(n)/n)^{2/(d+1)}\big{)},\text{ a.s.}

B\big{(}\mathcal{M},R_{1}-d_{H}(\partial C_{r}(\mathcal{Y}_{n}),\partial S)\big{)}\subset C_{r}(\mathcal{Y}_{n}).

B\big{(}\mathcal{M},R_{1}-d_{H}(\partial C_{r}(\mathcal{Y}_{n}),\partial S)\big{)}\subset C_{r}(\mathcal{Y}_{n}).

B\big{(}\partial S,d_{H}(\partial C_{r}(\mathcal{Y}_{n}),\partial S)\big{)}=B\big{(}\mathcal{M},R_{1}+d_{H}(\partial C_{r}(\mathcal{Y}_{n}),\partial S)\big{)}\setminus\mathring{B}\big{(}\mathcal{M},R_{1}-d_{H}(\partial C_{r}(\mathcal{Y}_{n}),\partial S)\big{)},

B\big{(}\partial S,d_{H}(\partial C_{r}(\mathcal{Y}_{n}),\partial S)\big{)}=B\big{(}\mathcal{M},R_{1}+d_{H}(\partial C_{r}(\mathcal{Y}_{n}),\partial S)\big{)}\setminus\mathring{B}\big{(}\mathcal{M},R_{1}-d_{H}(\partial C_{r}(\mathcal{Y}_{n}),\partial S)\big{)},

\tilde{R}_{n} \leq R_{1} .

\tilde{R}_{n} \leq R_{1} .

\tilde{R}_{n} \geq R_{1} - d_{H} (\partial C_{r} (Y_{n}), \partial S) - i min d (Y_{i}, M) .

\tilde{R}_{n} \geq R_{1} - d_{H} (\partial C_{r} (Y_{n}), \partial S) - i min d (Y_{i}, M) .

\mu_{d}\left(\mathcal{B}\big{(}\mathcal{M},(A\log(n)/n)^{1/(d-d^{\prime})}\big{)}\right)\geq c_{\mathcal{M}}A\log(n)/n.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Detection of low dimensionality and data denoising via set estimation techniques

Catherine Aarona, Alejandro Cholaquidisb and Antonio Cuevasc

a Université Blaise-Pascal Clermont II, France

b Centro de Matemática, Universidad de la República, Uruguay

c Departamento de Matemáticas, Universidad Autónoma de Madrid

Abstract

This work is closely related to the theories of set estimation and manifold estimation. Our object of interest is a, possibly lower-dimensional, compact set $S\subset{\mathbb{R}}^{d}$ . The general aim is to identify (via stochastic procedures) some qualitative or quantitative features of $S$ , of geometric or topological character. The available information is just a random sample of points drawn on $S$ . The term “to identify” means here to achieve a correct answer almost surely (a.s.) when the sample size tends to infinity. More specifically the paper aims at giving some partial answers to the following questions: is $S$ full dimensional? Is $S$ “close to a lower dimensional set” $\mathcal{M}$ ? If so, can we estimate $\mathcal{M}$ or some functionals of $\mathcal{M}$ (in particular, the Minkowski content of $\mathcal{M}$ )? As an important auxiliary tool in the answers of these questions, a denoising procedure is proposed in order to partially remove the noise in the original data. The theoretical results are complemented with some simulations and graphical illustrations.

1 Introduction

The general setup and some related literature. The emerging statistical field currently known as *manifold estimation *(or, sometimes, statistics on manifolds, or manifold learning) is the result of the confluence of, at least, three classical theories: (a) the analysis of directional (or circular) data Mardia and Jupp (2000), Bhattacharya and Patrangenaru (2008) where the aims are similar to those of the classical statistics but the data are supposed to be drawn on the sphere or, more generally, on a lower-dimensional manifold; (b) the study of non-linear methods of dimension reduction, Delicado (2001), Hastie and Stuetzle (1989), aiming at recovering a lower-dimensional structure from random points taken around it, and (c) some techniques of stochastic geometry Chazal and Lieutier (2005) and set estimation Cuevas and Fraiman (2010), Cholaquidis et al. (2014), Cuevas et al. (2007) whose purpose is to estimate some relevant quantities of a set (or the set itself) from the information provided by a random sample whose distribution is closely related to the set.

There are also strong connections with the theories of persistent homology and computational topology, Carlsson (2009), Niyogi, Smale and Weinberger (2011), Fasy et al. (2014), Cavanna et al (2015).

In all these studies, from different points of view, the general aim is similar: one wants to get information (very often of geometric or topological type) on a set from a sample of points. To be more specific, let us mention some recent references on these topics, roughly grouped according the subject (the list is largely non-exhaustive):

Manifold recovery from a sample of points, Genovese et al. (2012b); Genovese et al (2012c).

Inference on dimension, Fefferman et al. (2016), Brito et al. (2013).

Estimation of measures (perimeter, surface area, curvatures), Cuevas et al. (2007), Jiménez and Yukich (2011), Berrendero et al. (2014).

Estimation of some other relevant quantities in a manifold, Niyogi, Smale and Weinberger (2008), Chen and Müller (2012).

Dimensionality reduction, Genovese et al. (2012a), Tenebaum et al. (2000).

The problems under study. The contents of the paper. We are interested in getting some information (in particular, regarding dimensionality and Minkowski content) about a compact set ${\mathcal{M}}\subset{\mathbb{R}}^{d}$ . While the set ${\mathcal{M}}$ is typically unknown, we are supposed to have a random sample of points $X_{1},\ldots,X_{n}$ whose distribution $P_{X}$ has a support “close to ${\mathcal{M}}$ ”. To be more specific, we consider two different models:

The noiseless model: the support of $P_{X}$ is ${\mathcal{M}}$ itself; Aamari and Levrard (2015), Amenta et al. (2002), Cholaquidis et al. (2014), Cuevas and Fraiman (1997).
The parallel (noisy) model: the support of $P_{X}$ is the parallel set $S$ of points within a distance to ${\mathcal{M}}$ smaller than $R_{1}$ , for some $R_{1}>0$ , where $\mathcal{M}$ is a $d^{\prime}$ -dimensional set and $d^{\prime}\leq d$ ; Berrendero et al. (2014). Note that other different models “with noise” are considered in Genovese et al. (2012a), Genovese et al. (2012b) and Genovese et al (2012c).

In Section 3 we first develop, under the noiseless model, an algorithmic procedure to identify, eventually, almost surely (a.s.), whether or not $\mathcal{M}$ has an empty interior; this is achieved in Theorems 1 and 2 below. A positive answer would essentially entail (under some conditions, see the beginning of Section 3) that $\mathcal{M}$ has a dimension smaller than that of the ambient space.

Then, assuming the noisy model and $\mathring{{\mathcal{M}}}=\emptyset$ ( where $\mathring{\mathcal{M}}$ denotes the interior of $\mathcal{M}$ ) Theorems 3 (i) and 4 (i) provide two methods for the estimation of the maximum level of noise $R_{1}$ , giving also the corresponding convergence rates. If $R_{1}$ is known in advance, the remaining results in Theorems 3 and 4 allow us also to decide whether or not the “inside set” ${\mathcal{M}}$ has an empty interior.

The identification methods are “algorithmic” in the sense that they are based on automatic procedures to perform them with arbitrary precision. This will require to impose some regularity conditions on ${\mathcal{M}}$ or $S$ . Section 2 includes all the relevant definitions, notations and basic geometric concepts we will need.

In Section 4 we consider again the noisy model where the data are drawn on the $R_{1}$ -parallel set around a lower dimensional set ${\mathcal{M}}$ . We propose a method to “denoise” the sample, which essentially amounts to estimate $\mathcal{M}$ from sample data drawn around the parallel set $S$ around $\mathcal{M}$ .

In Section 5 we consider the problem of estimating the $d^{\prime}$ -dimensional Minkowski measure of ${\mathcal{M}}$ under both the noiseless and the noisy model. We assume throughout the section that the dimension $d^{\prime}$ (in Hausdorff sense, see below) of the set $\mathcal{M}$ is known.

Finally, in Section 6 we present some simulations and numerical illustrations.

2 Some geometric background

This section is devoted to make explicit the notations, and basic concepts and definitions (mostly of geometric character) we will need in the rest of the paper.

Some notation. Given a set $S\subset\mathbb{R}^{d}$ , we will denote by $\mathring{S}$ , $\overline{S}$ , $\partial S$ and $S^{c}$ , the interior, closure, boundary and complement of $S$ respectively, with respect to the usual topology of $\mathbb{R}^{d}$ . Let us denote $d(y,S)=\inf_{x\in S}\|y-x\|$ for $y\in{\mathbb{R}}^{d}$ , where $\|\cdot\|$ stands for the Euclidean norm. We will also denote $\rho(S)=\sup_{x\in S}d(x,\partial S)$ . Notice that $\rho(S)>0$ is equivalent to $\mathring{S}\neq\emptyset$ .

The parallel set of $S$ of radius $\varepsilon$ will be denoted as $B(S,\varepsilon)$ , that is $B(S,\varepsilon)=\{y\in{\mathbb{R}}^{d}:\ \inf_{x\in S}\break\|y-x\|\leq\varepsilon\}$ . If $A\subset\mathbb{R}^{d}$ is a Borel set, then $\mu_{d}(A)$ (sometimes just $\mu(A)$ ) will denote its Lebesgue measure. We will denote by $\mathcal{B}(x,\varepsilon)$ (or $\mathcal{B}_{d}(x,\varepsilon)$ , when necessary) the closed ball in $\mathbb{R}^{d}$ , of radius $\varepsilon$ , centred at $x$ , and $\omega_{d}=\mu_{d}(\mathcal{B}_{d}(x,1))$ . Given two compact non-empty sets $A,B\subset{\mathbb{R}}^{d}$ , the *Hausdorff distance *or *Hausdorff-Pompeiu distance *between $A$ and $C$ is defined by

[TABLE]

Some geometric regularity conditions for sets. The following conditions have been used many times in set estimation topics see, e.g., Niyogi, Smale and Weinberger (2008), Genovese et al. (2012b), Cuevas and Fraiman (2010) and references therein.

Definition 1.

Let $S\subset\mathbb{R}^{d}$ be a closed set. The set $S$ is said to satisfy the outside $r$ -rolling condition if for each boundary point $s\in\partial S$ there exists some $x\in S^{c}$ such that $\mathcal{B}(x,r)\cap\partial S=\{s\}$ . A compact set $S$ is said to satisfy the inside $r$ -rolling condition if $\overline{S^{c}}$ satisfies the outside $r$ -rolling condition at all boundary points.

Definition 2.

A set $S\subset\mathbb{R}^{d}$ is said to be $r$ -convex, for $r>0$ , if $S=C_{r}(S),$ where

[TABLE]

is the $r$ -convex hull of $S$ . When $S$ is $r$ -convex, a natural estimator of $S$ from a random sample $\mathcal{X}_{n}$ of points (drawn on a distribution with support $S$ ), is $C_{r}(\mathcal{X}_{n})$ .

Following the notation in Federer (1959), let ${\rm\text{Unp}}(S)$ be the set of points $x\in\mathbb{R}^{d}$ with a unique projection on $S$ .

Definition 3.

For $x\in S$ , let reach $(S,x)=\sup\{r>0:\mathring{\mathcal{B}}(x,r)\subset{\emph{Unp}}(S)\big{\}}$ . The reach of $S$ is defined by $\emph{reach}(S)=\inf\big{\{}\emph{reach}(S,x):x\in S\big{\}},$ and $S$ is said to be of positive reach if $\emph{reach}(S)>0$ .

The study of sets with positive reach was started by Federer (1959); see Thäle (2008) for a survey. This is now a major topic in different problems of manifold learning or topological data analysis. See, e.g., Adler et al. (2016) for a recent reference.

The conditions established in Definitions 1, 2 and 3 have an obvious mutual affinity. In fact, they are collectively referred to as “rolling properties” in Cuevas, Fraiman and Pateiro-López (2012). However, they are not equivalent: if the reach of $S$ is $r$ then $S$ is $r$ -convex, which in turn implies the (outer) $r$ -rolling condition. The converse implications are not true in general; see Cuevas, Fraiman and Pateiro-López (2012) for details.

Definition 4.

A set $S\subset\mathbb{R}^{d}$ is said to be standard with respect to a Borel measure $\nu$ at a point $x$ if there exists $\lambda>0$ and $\delta>0$ such that

[TABLE]

A set $S\subset\mathbb{R}^{d}$ is said to be standard if (3) holds for all $x\in S$ .

The following results will be useful below. The first one establishes a simple connection between standardness and the inside $r$ -rolling condition. The second one (whose proof can be found in Pateiro-López and Rodríguez-Casal (2009)) relates the rolling condition with the reach property.

Proposition 1.

Let $S\subset\mathbb{R}^{d}$ the support of a Borel measure $\nu$ , whose density $f$ with respect to the Lebesgue measure is bounded from below by $f_{0}$ , if $S$ satisfies $\text{reach}(\overline{S^{c}})\geq r$ , then it is standard with respect to $\nu$ , for any $\delta\leq f_{0}/3$ and $\lambda=r$ .

Proof.

Let $0<\varepsilon\leq r$ and $x\in S$ , if $d(x,\partial S)\geq r$ the result is obvious. Let $x\in S$ such that $d(x,\partial S)<r$ . Since $\text{reach}(\overline{S^{c}})\geq r$ there exists $z\in\mathbb{R}^{d}$ such that $x\in\mathcal{B}(z,r)\subset S$ . Then, for all $\varepsilon\leq r$

[TABLE]

∎

Proposition 2 (Lemma 2.3 in Pateiro-López and Rodríguez-Casal (2009)).

Let $S\subset\mathbb{R}^{d}$ be a non-empty closed set. If $S$ satisfies the inside and outside $r$ -rolling condition, then $\text{reach}(\partial S)\geq r$ .

Some basic definitions on manifolds. The following basic concepts are stated here for the sake of completeness and notational clarity. More complete information on these topics can be found, for example, in the classical textbooks Boothby (1975) and Do Carmo (1992). See also the book Galbis and Maestre (2010) and the summary (Zhang, 2011, chapter 3). Let us start with the classical concept of sub-manifold in ${\mathbb{R}}^{d}$ (often referred to simply as “manifold”). Denote by ${\mathbb{R}}^{k}_{+}$ the half-space ${\mathbb{R}}^{k}_{+}=\{x\in{\mathbb{R}}^{k}:\ x_{k}\geq 0\}$ .

Definition 5.

A topological sub-manifold ${\mathcal{M}}$ of dimension $k$ in ${\mathbb{R}}^{d}$ is a subset of ${\mathbb{R}}^{d}$ with $k\leq d$ such that every point in ${\mathcal{M}}$ has a neighborhood homeomorphic either to ${\mathbb{R}}^{k}$ or to ${\mathbb{R}}^{k}_{+}$ .

Those points of $\mathcal{M}$ having no neighborhood homeomorphic to ${\mathbb{R}}^{k}$ are called boundary points. If the boundary of $\mathcal{M}$ (i.e. the set of boundary points of $\mathcal{M}$ ) is empty we will say that ${\mathcal{M}}$ is a (sub-)manifold without boundary.

We will say that a manifold without boundary ${\mathcal{M}}$ is a regular $k$ -surface, or a differentiable $k$ -manifold of class $p\geq 1$ , if there is a family (often called atlas) ${\mathcal{V}}=\{(V_{\alpha},x_{\alpha})\}$ of pairs $(V_{\alpha},x_{\alpha})$ (often called parametrizations, coordinate systems or charts) such that the $V_{\alpha}$ are open sets in ${\mathbb{R}}^{k}$ and the $x_{\alpha}:V_{\alpha}\rightarrow{\mathcal{M}}$ are functions of class $p$ satisfying: (i) ${\mathcal{M}}=\cup_{\alpha}x_{\alpha}(V_{\alpha})$ , (ii) every $x_{\alpha}$ is a homeomorphism between $V_{\alpha}$ and $x_{\alpha}(V_{\alpha})$ and (iii) for every $v\in V_{\alpha}$ the differential $dx_{\alpha}(u):{\mathbb{R}}^{k}\rightarrow{\mathbb{R}}^{d}$ is injective.

A manifold with boundary $\mathcal{M}$ is said to be a regular $k$ -surface if the set of interior points in $\mathcal{M}$ is a regular $k$ -surface.

A manifold ${\mathcal{M}}$ is said to be compact when it is compact as a topological space. As a direct consequence of the definition of compactness, any compact differentiable manifold has a finite atlas. Typically, in most relevant cases the required atlas for a differentiable manifold has, at most, a denumerable set of charts.

An equivalent definition of the notion of manifold (see Do Carmo (1992, Def 2.1, p. 2)) can be stated in terms of parametrizations or coordinate systems of type $(U_{\alpha},\varphi_{\alpha})$ with $\varphi_{\alpha}:V_{\alpha}\subset{\mathcal{M}}\rightarrow{\mathbb{R}}^{k}$ . The conditions would be completely similar to the previous ones, except that the $\varphi_{\alpha}$ are defined in a reverse way to that of Definition 5.

In the simplest case, just one chart $x:V\rightarrow\mathcal{M}$ is needed. The structures defined in this way are sometimes called planar manifolds.

Some background on geometric measure theory. The important problem of defining lower-dimensional measures (surface measure, perimeter, etc.) has been tackled in different ways. The book by Mattila (1995) is a classical reference. We first recall the so-called Hausdorff measure. It is defined for any separable metric space $({\mathcal{M}},\rho)$ . Given $\delta,r>0$ and $E\subset{\mathcal{M}}$ , let

[TABLE]

where ${\rm diam}(B)=\sup\{\rho(x,y):x,y\in B\}$ , $\inf\emptyset=\infty$ . Now, define ${\mathcal{H}}^{r}(E)=\lim_{\delta\to 0}{\mathcal{H}}^{r}_{\delta}(E)$ .

The set function ${\mathcal{H}}^{r}$ is an outer measure. If we restrict ${\mathcal{H}}^{r}$ to the measurable sets (according to standard Caratheodory’s definition) we get the $r$ -dimensional Hausdorff measure on ${\mathcal{M}}$ .

The Hausdorff dimension of a set $E$ is defined by

[TABLE]

It can be proved that, when ${\mathcal{M}}$ is a $k$ -dimensional smooth manifold, $\dim_{H}({\mathcal{M}})=k$ .

Another popular notion to define lower-dimensional measures for the case ${\mathcal{M}}\subset{\mathbb{R}}^{d}$ is the Minkowski content. For an integer $d^{\prime}<d$ recall that $\omega_{d-d^{\prime}}=\mu_{d-d^{\prime}}({\mathcal{B}}(0,1))$ and define the $d^{\prime}$ -dimensional Minkowski content of a set ${\mathcal{M}}$ by

[TABLE]

provided that this limit does exist.

In what follows we will often denote $L_{0}^{d^{\prime}}({\mathcal{M}})=L_{0}({\mathcal{M}})$ , when the value of $d^{\prime}$ is understood. The term “content” is used here as a surrogate for “measure”, as the expression (5) does not generally leads to a true (sigma-additive) measure.

A compact set ${\mathcal{M}}\subset{\mathbb{R}}^{d}$ is said to be $d^{\prime}$ -rectifiable if there exists a compact set $K\subset{\mathbb{R}}^{d^{\prime}}$ and a Lipschitz function $f:{\mathbb{R}}^{d^{\prime}}\rightarrow{\mathbb{R}}^{d}$ such that ${\mathcal{M}}=f(K)$ . Theorem 3.2.39 in Federer (1969) proves that for a compact $d^{\prime}$ -rectifiable set ${\mathcal{M}}$ , $L_{0}^{d^{\prime}}({\mathcal{M}})={\mathcal{H}}^{d^{\prime}}(\mathcal{M})$ . More details on the relations between the rectifiability property and the structure of manifold can be found in Federer (1969) Theorem 3.2.29.

3 Checking closeness to lower dimensionality

We consider here the problem of identifying whether or not the set ${\mathcal{M}}\subset{\mathbb{R}}^{d}$ (not necessarily a manifold) has an empty interior.

Note that, if ${\mathcal{M}}\subset\mathbb{R}^{d}$ is “regular enough”, $\dim_{H}(\mathcal{M})<d$ is in fact equivalent to $\mathring{\mathcal{M}}=\emptyset$ . Indeed, in general $\dim_{H}(\mathcal{M})<d$ implies $\mathring{\mathcal{M}}=\emptyset$ . The converse implication is not always true, even for sets fulfilling the property $\mathcal{H}^{d}(\partial\mathcal{M})=0$ (see Avila and Lyubich (2007)). However it holds if $\mathcal{M}$ has positive reach, since in this case $\mathcal{H}^{d-1}(\partial\mathcal{M})<\infty$ (see the comments after Th. 7 and inequality (27) in Ambrosio, Colesanti and Villa (2008)).

Also, clearly, in the case where ${\mathcal{M}}$ is a manifold, the fact that ${\mathcal{M}}$ has an empty interior amounts to say that its dimension is smaller than that of the ambient space.

3.1 The noiseless model

We first consider the case where the sample information follows the noiseless model explained in the Introduction, that is, the data $\mathcal{X}_{n}=\{X_{1},\ldots,X_{n}\}$ are assumed to be an $iid$ sample of points drawn from an unknown distribution $P_{X}$ with support $\mathcal{M}\subset\mathbb{R}^{d}$ . When ${\mathcal{M}}$ is a lower-dimensional set, this model can be considered as an extension of the classical theory of directional (or spherical) data, in which the sample data are assumed to follow a distribution whose support is the unit sphere in ${\mathbb{R}}^{d}$ . See, e.g., Mardia and Jupp (2000).

Our main tool here will be the simple *offset *or Devroye-Wise estimator (see Devroye and Wise (1980)) given by

[TABLE]

More specifically, we are especially interested in the “boundary balls” of $\hat{S}_{n}(r)$ .

Definition 6.

Given $r>0$ let $\hat{S}_{n}(r)$ the set estimator (6) based on $\{X_{1},\ldots,X_{n}\}$ . We will say that $\mathcal{B}(X_{i},r)$ is a boundary ball of $\hat{S}_{n}(r)$ if there exists a point $y\in\partial\mathcal{B}(X_{i},r)$ such that $y\in\partial\hat{S}_{n}(r)$ . The “peeling” of $\hat{S}_{n}(r)$ , denoted by ${\rm peel}(\hat{S}_{n}(r))$ , is the union of all non-boundary balls of $\hat{S}_{n}(r)$ . In other words, ${\rm peel}(\hat{S}_{n}(r))$ is the result of removing from $\hat{S}_{n}(r)$ all the boundary balls.

The following theorem is the main result of this section. It relates, in statistical terms, the emptiness of $\mathring{\mathcal{M}}$ with ${\rm peel}(\hat{S}_{n})$ .

Theorem 1.

Let $\mathcal{M}\subset\mathbb{R}^{d}$ be a compact non-empty set. Then under the model and notations stated in the two previous paragraphs we have,

(i) if $\mathring{\mathcal{M}}=\emptyset$ , and $\mathcal{M}$ fulfills the outside rolling condition for some $r>0$ , then ${\rm peel}(\hat{S}_{n}(r^{\prime}))=\emptyset$ for any set $\hat{S}_{n}(r^{\prime})$ of type (6) with $r^{\prime}<r$ .

(ii) In the case $\mathring{\mathcal{M}}\neq\emptyset$ , assume that there exists a ball $\mathcal{B}(x_{0},\rho_{0})\subset\mathring{\mathcal{M}}$ such that $\mathcal{B}(x_{0},\rho_{0})$ is standard w.r.t to $P_{X}$ , with constants $\delta$ and $\lambda=\rho_{0}$ (see Definition (4)). Then ${\rm peel}(\hat{S}_{n}(r_{n}))\neq\emptyset$ eventually, a.s., where $r_{n}$ is a radius sequence such that: $(\kappa\frac{\log(n)}{n})^{1/d}\leq r_{n}\leq\rho_{0}/2$ for a given $\kappa>(\delta\omega_{d})^{-1}$ .

Proof.

(i) To prove that ${\rm peel}(\hat{S}_{n}(r^{\prime}))=\emptyset$ for all $r^{\prime}<r$ it is enough to prove that for all $r^{\prime}<r$ and for all $i=1,\dots,n$ there exists a point $y_{i}\in\partial\mathcal{B}(X_{i},r^{\prime})$ such that $y_{i}\notin\mathcal{B}(X_{j},r^{\prime})$

for all $X_{j}\neq X_{i}$ . Since $\mathcal{M}$ is closed and $\mathring{\mathcal{M}}=\emptyset$ , $\partial\mathcal{M}=\mathcal{M}$ . The outside rolling ball property implies that for all $X_{i}\in\mathcal{M}$ exists $z_{i}\in\mathcal{M}^{c}$ such that $\mathcal{B}(z_{i},r)\cap\mathcal{M}=\{X_{i}\}$ . Let us denote $u_{i}=(z_{i}-X_{i})/r$ , then $y_{i}=X_{i}+r^{\prime}u_{i}$ see Figure 1. Clearly $y_{i}\in\partial\mathcal{B}(X_{i},r^{\prime})$ . From $\mathcal{B}(y_{i},r^{\prime})\subset\mathcal{B}(z_{i},r)$ and the outside rolling ball property we get that $\{X_{i}\}\subset\mathcal{B}(y_{i},r^{\prime})\cap\mathcal{X}_{n}\subset\mathcal{B}(z_{i},r)\cap\mathcal{M}\subset\{X_{i}\}$ so that, for all $X_{j}\neq X_{i}$ , $X_{j}\notin\mathcal{B}(y_{i},r^{\prime})$ and thus, $y_{i}\notin\mathcal{B}(X_{j},r^{\prime})$ .

(ii) First we are going to prove that

[TABLE]

Consider only $n\geq 3$ and let $\varepsilon_{n}=(\log(n))^{-1}$ , there is a positive constant $\tau_{d}$ , such that we can cover $\mathcal{B}(x_{0},2r_{n})$ with $\nu_{n}=\tau_{d}\varepsilon_{n}^{-d}$ balls of radius $r_{n}\varepsilon_{n}$ centred in $\{t_{1},\dots,t_{\nu_{n}}\}$ . Let us define

[TABLE]

then,

[TABLE]

Notice that for any given $i$ ,

[TABLE]

Since $r_{n}\leq\rho_{0}/2$ , $t_{i}\in\mathcal{B}(x_{0},\rho_{0})$ , then using that $\mathcal{B}(x_{0},\rho_{0})$ is standard with the same $\delta$ ,

[TABLE]

Which, according to (8) provides:

[TABLE]

where we have used that $(1-x)^{n}\leq\exp(-nx)$ . Since $C>1$ , we can choose $\beta>1$ such that $p_{n}/n^{-\beta}\rightarrow 0$ , then, $\sum p_{n}<\infty$ . Finally (7) follows as a direct application of Borel Cantelli Lemma. Observe that (7) implies that $x_{0}\in\hat{S}_{n}(r_{n})$ eventually a.s. see Figure 2, so there exists $X_{i}$ such that $x_{0}\in\mathcal{B}(X_{i},r_{n})$ eventually a.s. Again by (7) we get that, eventually a.s. for all $z\in\partial\mathcal{B}(X_{i},r_{n})$ there exists $X_{j}$ such that $z\in\mathring{\mathcal{B}}(X_{j},r_{n})$ and so $z\notin\partial\hat{S}_{n}(r_{n})$ , which implies that, eventually a.s., $\mathcal{B}(X_{i},r_{n})$ is not removed by the peeling process and so ${\rm peel}(\hat{S}_{n}(r_{n}))\neq\emptyset$ eventually, a.s..

∎

Remark 1.

Some comments on Theorem 1 are in order, regarding the intuitive meaning of the result itself, the required assumptions and the involved parameters. First note that the outside rolling condition imposed in part (i) is nothing but a geometric smoothness property ruling out the existence of very sharp inward peaks in the boundary of the set. It is close, but not equivalent, to the positive reach condition, as stated in Definition 3. Clearly, the value of the parameter $r$ in Theorem 1 is a regularity condition on $\mathcal{M}$ : the larger $r$ , the more regular $\mathcal{M}$ . In general, if we want to obtain, using statistical methods, some meaningful results on the dimensionality or the interior of $\mathcal{M}$ , we will need to impose some regularity property. The advantage of the rolling condition is its simple intuitive, almost “visual”, interpretation. See Walther (1999) and Cuevas, Fraiman and Pateiro-López (2012) for further insights on the rolling condition and related properties.

Regarding part (ii): if $\mathring{\mathcal{M}}\neq\emptyset$ there must be some ball ${\mathcal{B}}(x_{0},\rho_{0})$ included in $\mathring{\mathcal{M}}\neq\emptyset$ . The standardness assumption imposed in the theorem, only asks that the probability $P_{X}$ is not “too far from uniformity” on that ball. To be more specific, the probability of the intersection with ${\mathcal{B}}(x_{0},\rho_{0})$ of any small enough ball $B$ centered at a point of ${\mathcal{B}}(x_{0},\rho_{0})$ must be at most $\delta$ times the volume of $B$ . Observe that this mild condition holds, in particular, whenever $P_{X}$ has a density $f$ bounded from below by a positive constant. More insights on the meaning and use of this standardness property can be found, for example, in Cuevas and Fraiman (1997) and Rinaldo and Wasserman (2010).

Finally, about the interpretation of parts (i) and (ii) in the theorem: statement (i) is simple. It just establishes that the property $\mathring{\mathcal{M}}=\emptyset$ can be identified, with probability one, whatever the simple size using the offset estimator (6) with any radius smaller that the assumed rolling parameter $r$ . As for part (ii), let us note that the only relevant parameter is the standardness constant $\delta$ . A conservative choice of $\delta$ would also do the job asymptotically. In this case, the identification of $\mathring{\mathcal{M}}\neq\emptyset$ is done asymptotically (eventually, a.s.) by taking the offset estimator with balls of radii $r_{n}$ depending only on $\delta$ and $n$ . The order $(\log n/n)^{1/d}$ of such balls appears typically in the convergence rates of many set estimators (see Cuevas and Fraiman (1997), Rodríguez-Casal (2007)) as well as in the theory of multivariate spacings, Janson (1987).

Hence, in summary, the method to identify whether or not $\mathring{\mathcal{M}}=\emptyset$ is completely “algorithmic” and works, under some regularity conditions on $\mathcal{M}$ , with probability one. While the situation $\mathring{\mathcal{M}}=\emptyset$ is easy to identify, the identification of $\mathring{\mathcal{M}}\neq\emptyset$ only works asymptotically.

The manifold case. If $\mathcal{M}$ is assumed to be a manifold, then, under some mild additional assumptions, the identification of low dimensionality can be done in a completely automatic (data-driven) way, with no resort to extra parameters. In other words, the radius of the balls in the auxiliary Devroye-Wise estimator can be chosen as a function of the data in such a way that it is (asymptotically) small enough to identify the situation $\mathring{\mathcal{M}}=\emptyset$ and large enough to eventually detect $\mathring{\mathcal{M}}=\emptyset$ , when this is the case.

Theorem 2.

Let $\mathcal{M}$ be a $d^{\prime}$ -dimensional compact manifold in ${\mathbb{R}}^{d}$ . Suppose that the sample points $X_{1},\ldots,X_{n}$ are drawn from a probability measure $P_{X}$ with support $\mathcal{M}$ which has a density $f$ , with respect the $d^{\prime}$ -dimensional Hausdorff measure on $\mathcal{M}$ , continuous on $\mathcal{M}$ such that $f_{0}=\min_{x\in\mathcal{M}}f(x)>0$ . Let us define, for any $\beta>6^{1/d}$ , $r_{n}=\beta\max_{i}\min_{j\neq i}\|X_{j}-X_{i}\|$ . Then,

$i)$

if $d^{\prime}=d$ and $\partial\mathcal{M}$ is a $\mathcal{C}^{2}$ manifold then ${\rm peel}(\hat{S}_{n}(r_{n}))\neq\emptyset$ eventually, a.s..

$ii)$

if $d^{\prime}<d$ and $\mathcal{M}$ is a $\mathcal{C}^{2}$ manifold without boundary, then ${\rm peel}(\hat{S}_{n}(r_{n}))=\emptyset$ eventually, a.s.

Proof.

$i)$

We will use Theorem 1 (ii). In order to do that, we will prove first that the set is standard. As $d^{\prime}=d$ then $\partial\mathcal{M}$ is a $\mathcal{C}^{2}$ a compact $(d-1)$ -manifold. Then we can use the following result, due to Walther (1999).

Theorem (Walther, 1999, Th.1).- Let $S\subset\mathbb{R}^{d}$ be a compact path-connected set with $\mathring{S}\neq\emptyset$ and let $r_{0}>0$ . Then, the following conditions are equivalent

1

A ball of radius $r$ rolls freely inside $S$ and inside $\overline{S^{c}}$ for all $0\leq r\leq r_{0}$ .

2

$\partial S$ is a $(d-1)$ -dimensional $C^{1}$ submanifold in $\mathbb{R}^{d}$ with the outward pointing unit normal vector $n(s)$ at $s\in\partial S$ satisfying the Lipschitz condition

[TABLE]

In fact, the author points out that the result is also valid if the condition of path-connected is dropped and we only assume that every path connected component of $\mathcal{M}$ has non-empty interior. Hence, note that this result can be applied in our case for $S=\mathcal{M}$ since the ${\mathcal{C}}^{2}$ assumption on the compact hypersurface $\partial\mathcal{M}$ implies the Lipschitz condition for the outward normal vector and the assumption $\mathring{\mathcal{M}}_{1}\neq\emptyset$ for every path-connected component $\mathcal{M}_{1}$ of $\mathcal{M}$ is guaranteed from the fact that every point in $\mathcal{M}$ has a neighborhood homeomorphic to an open set in ${\mathbb{R}}_{+}^{d}$ . Thus, we may use the result 2 $\Rightarrow$ 1 in the above theorem to conclude that $\mathcal{M}$ fulfills both the inside and outside rolling ball property for a small enough radius $r>0$ . Then by Proposition 2 $\text{reach}(\partial\mathcal{M})\geq r$ . So, by Proposition 1, $\mathcal{M}$ satisfies the standardness condition established in Definition 4 with $\nu=P_{X}$ , $\delta=f_{0}/3$ and $\lambda<r$ . Now, in order to prove that $r_{n}$ fulfils all the conditions in Theorem 1 (ii) observe that in the full-dimensional case $d^{\prime}=d$ the intrinsic volume in $\mathcal{M}$ coincides with the restricted Lebesgue measure; see (Taylor, 2006, Prop. 12.6). As a consequence, $f$ is equal to the density of $P_{X}$ w.r.t. the Lebesgue measure restricted to $\mathcal{M}$ . Let us denote $f_{1}=\min_{x\in\partial\mathcal{M}}f(x)$ . Note that $r_{n}/\beta$ is in fact the “connectivity statistic”, that is the minimum value of $r$ such that $\cup_{i}{\mathcal{B}}(X_{i},r)$ is a connected set. Then, as $f$ is continuous and bounded below from zero on the compact set $\mathcal{M}$ with smooth boundary we are in the assumptions of Theorem 1.1 in Penrose (1999) so that, using this result we can conclude that, with probability one, we have,

[TABLE]

Then for $n$ large enough,

[TABLE]

now if we denote $\kappa=\beta^{d}/(\omega_{d}2f_{0})$ , it fulfills that $\kappa>(\delta\omega_{d})^{-1}$ , so we are in the hypotheses of Theorem 1 (ii) and then we can conclude ${\rm peel}(\hat{S}_{n}(r_{n}))\neq\emptyset$ eventually, with probability 1.

$ii)$

Notice that we can use Theorem 1 (ii) indeed, as $\mathcal{M}$ is a $\mathcal{C}^{2}$ compact manifold of ${\mathbb{R}}^{d}$ by (Thäle, 2008, Prop. 14) it has a positive reach and, thus, it satisfies the outside rolling ball condition (for some radius $r>0$ ). Then it remains to be proved that $r_{n}\leq r$ for $n$ large enough. Let us endow $\mathcal{M}$ with the standard Riemannian structure, where a local metric is defined on every tangent space just by restricting on it the standard inner product on ${\mathbb{R}}^{d}$ . Under smoothness assumptions, the Riemannian measure induced by such a metric on the manifold $\mathcal{M}$ agrees with the $d^{\prime}$ -dimensional Hausdorff measure on $\mathcal{M}$ (this is just a particular case of the Area Formula; see (Federer, 1969, 3.2.46)). So we may use Theorem 5.1 in Penrose (1999). As a consequence of that result

[TABLE]

where $\gamma$ denotes the geodesic distance on $\mathcal{M}$ associated with the Riemannian structure. Now, since the Euclidean distance is smaller than the geodesic distance, we have for all $i,j$ , $\|X_{j}-X_{i}\|\leq\gamma(X_{i},X_{j})$ and $\min_{j}\gamma(X_{i},X_{j})=\gamma(X_{i},X_{i^{\prime}})\geq\|X_{i}-X_{i^{\prime}}\|\geq\break\min_{j}\|X_{i}-X_{j}\|$ and finally $\max_{i}\min_{j\neq i}\gamma(X_{i},X_{j})\geq\max_{i}\min_{j\neq i}\|X_{i}-X_{j}\|$ . Finally from (9) we have $\max_{i}\min_{j\neq i}\|X_{j}-X_{i}\|\stackrel{{\scriptstyle a.s.}}{{\longrightarrow}}0$ , which concludes the proof.

∎

3.2 The case of noisy data: the “parallel” model

The following two theorems are meaningful in at least two ways. On the one hand, if we know the amount of noise ( $R_{1}$ in the notation introduced before), these results can be used to detect whether or not the support $\mathcal{M}$ of the original sample is full dimensional (see (11) and (15)).

On the other hand, in the lower dimensional setting, they give an easy-to-implement way to estimate $R_{1}$ (see (10) and (14)).

Observe that when $\mathring{\mathcal{M}}=\emptyset$ , then $R_{1}=\max_{x\in S}d(x,\partial S)$ . If $\widehat{\partial S}_{n}$ denotes a consistent estimator of $\partial B(\mathcal{M},R_{1})$ , a natural plug-in estimator for $R_{1}$ is $\max_{Y_{i}\in\mathcal{Y}_{n}}d(\mathcal{Y}_{n},\widehat{\partial S}_{n})$ .

In Theorem 3 $\widehat{\partial S}_{n}$ is constructed in terms of the set of the centers of the boundary balls, while in Theorem 4 we use the boundary of the $r$ -convex hull. The second theorem is stronger than the first one in several aspects: the parameter choice is easier and the convergence rate is better (and does not depend on the parameter). The price to pay is computational since the corresponding statistic is much more difficult to implement; see Section 6.

Theorem 3.

Let $\mathcal{M}\subset\mathbb{R}^{d}$ be a compact set such that $\emph{reach}(\mathcal{M})=R_{0}>0$ . Let $R_{1}$ be a constant with $0<R_{1}<R_{0}$ and let $\mathcal{Y}_{n}=\{Y_{1},\dots,Y_{n}\}$ be an iid sample of a distribution $P_{Y}$ with support $S=B(\mathcal{M},R_{1})$ , absolutely continuous with respect to the Lebesgue measure, whose density $f$ is bounded from below by $f_{0}>0$ . Let $\varepsilon_{n}=c(\log(n)/n)^{1/d}$ , with $c>(6/(f_{0}\omega_{d}))^{1/d}$ , and let us denote $\hat{R}_{n}=\max_{Y_{i}\in\mathcal{Y}_{n}}\min_{j\in I_{bb}}\|Y_{i}-Y_{j}\|$ where $I_{bb}=\{j:\mathcal{B}(Y_{j},\varepsilon_{n})\text{ is a boundary ball}\}$ .

i)

if $\mathring{\mathcal{M}}=\emptyset$ then, with probability one,

[TABLE]

ii)

if $\mathring{\mathcal{M}}\neq\emptyset$ then there exists $C>0$ such that, with probability one

[TABLE]

Proof.

$i)$

Observe, that, since $\mathring{\mathcal{M}}=\emptyset$ , $R_{1}=\max_{x\in S}d(x,\partial S)$ . Then, the proposed estimator $\hat{R}_{n}$ is quite natural: roughly speaking, we may consider that the set of centres of the boundary balls is an estimator of the boundary of $S$ so that the maximum distance from the sample points to these centres is a natural estimator of the parameter $R_{1}$ that measures the “thickness” of $S$ . We will now use Corollary 4.9 in Federer (1959); this result establishes that for $r>0$ , the $r$ -parallel set of a non-empty closed set $A$ fulfills $\mbox{reach}(B(A,r))\geq\mbox{reach}(A)-r$ . Also, $\mbox{reach}\{x:d(x,A)\geq r\}\geq r$ . Then, in our case, for $S=B(\mathcal{M},R_{1})$ , this result yields $\text{reach}(S)\geq R_{0}-R_{1}>0$ and $\text{reach}(\overline{S^{c}})\geq R_{1}$ . By Proposition 1 and 2 in Cuevas, Fraiman and Pateiro-López (2012) $S$ fulfils the inner and outer rolling condition.

Another consequence of the positive reach of $S$ is that it has a Lebesgue null boundary and thus, with probability one for all $i$ , $Y_{i}\in\mathring{S}$ and then, with probability one

[TABLE]

Since $\text{reach}(\overline{S^{c}})>0$ , by Proposition 1 $S$ is standard with respect to $P_{X}$ for any constant $\delta<f_{0}/3$ (see Definition 4).

Now, we will use Theorem 4 and Proposition 1 in Cuevas and Rodriguez-Casal (2004); according to these result, if $S$ is partially expandable and it is standard with respect to $P_{X}$ (both conditions are satisfied in our case) we have for large enough $n$ , with probability one,

[TABLE]

for a choice of $\varepsilon_{n}$ as that indicated in the above statement of the theorem.

For all $x\in S$ let us consider $z\in\partial S$ a point such that $\|x-z\|=d(x,\partial S)$ and $t=z+\varepsilon_{n}\eta$ where $\eta=\eta(z)$ is a normal vector to $\partial S$ at $z$ that points outside $S$ ( $\eta$ can be defined according to Definition 4.4 and Theorem 4.8 (12) in Federer (1959)). Notice that the metric projection of $t$ on $S$ is $z$ thus $d(t,S)=\varepsilon_{n}$ so, according to (12), with probability one $t\notin\hat{S}_{n}(\varepsilon_{n})$ . The point $z$ belongs to $S$ so, by (13), with probability one for $n$ large enough $z\in\hat{S}_{n}(\varepsilon_{n})$ . We thus conclude $[t,z]\cap\partial\hat{S}_{n}(\varepsilon_{n})\neq\emptyset$ , with probability one, for $n$ large enough. Let then consider $y\in[t,z]\cap\partial\hat{S}_{n}(\varepsilon_{n})$ , as $y\in\partial\hat{S}_{n}(\varepsilon_{n})$ there exists $i\in I_{bb}$ such that $y\in\partial\mathcal{B}(Y_{i},\varepsilon_{n})$ and, as $y\in[t,z]$ , $\|y-z\|\leq\varepsilon_{n}$ thus $\|x-Y_{i}\|\leq\|x-z\|+\|z-y\|+\|y-Y_{i}\|\leq d(x,\partial S)+2\varepsilon_{n}$ . To summarize we just have proved that: for all $x\in S$ there exits $i\in I_{bb}$ such that $\|x-Y_{i}\|\leq d(x,\partial S)+2\varepsilon_{n}$ thus for all $x\in S$ : $\min_{i\in I_{bb}}\|x-Y_{i}\|\leq d(x,\partial S)+2\varepsilon_{n}$ . To conclude $\max_{j}\min_{i\in I_{bb}}\|Y_{j}-Y_{i}\|\leq\max_{j}d(Y_{j},\partial S)+2\varepsilon_{n}\leq\max_{x\in S}d(x,\partial S)+2\varepsilon_{n}=R_{1}+2\varepsilon_{n}$ (with probability one for $n$ large enough).

The reverse inequality is easier to prove, let us consider $x_{0}\in S$ such that $d(x_{0},\partial S)=R_{1}$ , notice that, by (13) (with probability one for $n$ large enough) there exists $i_{0}$ such that $\|x_{0}-Y_{i_{0}}\|\leq\varepsilon_{n}$ . By triangular inequality $\mathcal{B}(Y_{i_{0}},R_{1}-\varepsilon_{n})\subset S$ and by (13) we also have $\mathcal{B}(Y_{i_{0}},R_{1}-\varepsilon_{n})\subset\hat{S}_{n}(\varepsilon_{n})$ thus $\min_{i\in I_{bb}}\{\|Y_{i_{0}}-Y_{i}\|\}\geq R_{1}-2\varepsilon_{n}$ . Then we have proved $\max_{j}\min_{i\in I_{bb}}\{\|Y_{i}-Y_{j}\|\}\geq R_{1}-2\varepsilon_{n}$ . This concludes the proof of (10).

$ii)$

Observe that to prove $i)$ we proved that $|\hat{R}_{n}-\max_{x\in S}d(x,\partial S)|<2\varepsilon_{n}$ . Then, with probability one, for $n$ large enough, $|\hat{R}_{n}-R_{1}|>|c_{1}-R_{1}|/2=C>0$ , where $c_{1}=\max_{x\in\partial S}d(x,\partial S)$ .

∎

Theorem 4.

Let $\mathcal{M}\subset\mathbb{R}^{d}$ be a compact set such that $\emph{reach}(\mathcal{M})=R_{0}>0$ . Suppose that the sample $\mathcal{Y}_{n}=\{Y_{1},\ldots,Y_{n}\}$ has a distribution with support $S=B({\mathcal{M}},R_{1})$ for some $R_{1}<R_{0}$ with a density bounded from below by a constant $f_{0}>0$ . Let us denote $\tilde{R}_{n}=\max_{i}d(Y_{i},\partial C_{r}(\mathcal{Y}_{n}))$ where $C_{r}(\mathcal{Y}_{n})$ denotes the $r$ -convex hull of the sample, as defined in (2) for $r\leq\min(R_{1},R_{0}-R_{1})$ .

i)

If $\mathring{\mathcal{M}}=\emptyset$ and for some $d^{\prime}<d$ $\mathcal{M}$ has a finite, strictly positive $d^{\prime}$ -dimensional Minkowski content, then, with probability one,

[TABLE]

ii)

if $\mathring{\mathcal{M}}\neq\emptyset$ , then there exists $C>0$ such that, with probability one

[TABLE]

Proof.

Again, as shown in the proof of Theorem 3, $\mbox{reach}(B({\mathcal{M}},R_{1}))\geq\mbox{reach}({\mathcal{M}})-R_{1}=R_{0}-R_{1}$ ; also $\mbox{reach}(\overline{B({\mathcal{M}},R_{1})^{c}})\geq R_{1}$ . We now use Proposition 1 in Cuevas, Fraiman and Pateiro-López (2012); this result establishes that $\mbox{reach}(S)\geq r$ implies that $S$ is $r$ -convex. According to this result we may conclude that $B({\mathcal{M}},R_{1})$ and $\overline{B({\mathcal{M}},R_{1})^{c}}$ are both $r$ -convex for $r=\min(R_{1},R_{0}-R_{1})>0$ . Note, in addition, that by construction of $S=B({\mathcal{M}},R_{1})$ we have that $\mathring{S_{i}}\neq\emptyset$ for every path-connected component $S_{i}\subset S$ . So, we can use Theorem 3 in Rodríguez-Casal (2007) (which establishes the rates of convergence in the estimation of an $r$ -convex set using the $r$ -convex hull of the sample) to conclude

[TABLE]

Let us now prove that, with probability one, for $n$ large enough,

[TABLE]

Proceeding by contradiction, let $x_{n}\in B\big{(}\mathcal{M},R_{1}-d_{H}(\partial C_{r}(\mathcal{Y}_{n}),\partial S)\big{)}$ such that $x_{n}\notin C_{r}(\mathcal{Y}_{n})$ , let $y_{n}$ be the projection of $x_{n}$ onto $\mathcal{M}$ . It is easy to see that, for $n$ large enough, with probability one, $\mathcal{M}\subset C_{r}(\mathcal{Y}_{n})$ then $y_{n}\in C_{r}(\mathcal{Y}_{n})$ . Observe that, from the definition of parallel set,

[TABLE]

then, there exists $z_{n}\in\partial C_{r}(\mathcal{Y}_{n})\cap(x_{n},y_{n})$ , $(x_{n},y_{n})$ being the open segment joining $x_{n}$ and $y_{n}$ , but then by (18), $d(z_{n},\partial S)>d_{H}(\partial C_{r}(\mathcal{Y}_{n}),\partial S)$ which is a contradiction; this concludes the proof of (17).

Now we can prove $i)$ . Suppose that $\mathring{\mathcal{M}}=\emptyset$ . Then $R_{1}=\max_{x\in S}d(x,\partial S)=\max_{x\in\mathcal{M}}d(x,\partial S)=d_{H}(\mathcal{M},\partial S)$ . Also, as $C_{r}({\mathcal{Y}}_{n})\subset S$ thus

[TABLE]

For every observation $Y_{i}$ let $m_{i}$ denote its projection on $\mathcal{M}$ ; by (17) we have $d(m_{i},\partial C_{r}(\mathcal{Y}_{n}))\geq R_{1}-d_{H}(\partial C_{r}(\mathcal{Y}_{n}),\partial S)$ so that, from triangular inequality,

$d(m_{i},Y_{i})+d(Y_{i},\partial C_{r}(\mathcal{Y}_{n}))\geq R_{1}-d_{H}(\partial C_{r}(\mathcal{Y}_{n}),\partial S)$ . Thus

[TABLE]

We now analyze the order of the last term in (20). From the assumption of finiteness of the Minkowski content of $\mathcal{M}$ , given a constant $A>0$ there exists a constant $c_{\mathcal{M}}>0$ such that for $n$ large enough,

[TABLE]

Thus,

[TABLE]

If we take $A>1/(f_{0}c_{\mathcal{M}})$ we obtain, from Borel-Cantelli lemma,

[TABLE]

Finally, (14) is a direct consequence of (16), (19), (20) and (21).

The proof of $ii)$ is obtained as in Theorem 3 part $ii)$ . ∎

Remark 2.

The assumption imposed on $\mathcal{M}$ in part (i) can be seen as an statement of $d^{\prime}$ -dimensionality. For example if we assume that $\mathcal{M}$ is rectifiable then, from Theorem 3.2.39 in Federer (1969), the $d^{\prime}$ -dimensional Hausdorff measure of $\mathcal{M}$ , ${\mathcal{H}}^{d^{\prime}}(\mathcal{M})$ coincides with the corresponding Minkowski content. Hence $0<{\mathcal{H}}^{d^{\prime}}(\mathcal{M})<\infty$ and, according to expression (4), this entails $\mbox{dim}_{H}(\mathcal{M})=d^{\prime}$ .

3.3 An index of closeness to lower dimensionality

According to Theorem 3 in the case $R_{1}=0$ , the value $2\hat{R}_{n}/\widehat{{\rm diam}}(\mathcal{M})$ (where $\widehat{{\rm diam}}(\mathcal{M})=\max_{i\neq j}\break\|X_{i}-X_{j}\|$ ) can be seen as an index of departure from low-dimensionality. Observe that if $\mathcal{M}=\overline{\mathring{\mathcal{M}}}$ we get $2\hat{R}_{n}/\widehat{{\rm diam}}(\mathcal{M})\to 1$ , a.s. and if $\mathcal{M}$ has empty interior, $2\hat{R}_{n}/\widehat{{\rm diam}}(\mathcal{M})\to 0$ a.s.

4 A method to partially denoise the sample data

There are several situations in which we may speak of “noise in the data”: we could first mention the “outlier model” in which the noise is given by a certain amount of outlying observations, far away from the central core of the data. Also, we might have a situation in which every observation is perturbed with a small amount of noise. We will present in this section a denoising proposal, dealing with the latter case and related to the models considered in the previous sections. Before presenting this proposal we will let us briefly comment some references that, from different points of view, deal with the problem of noisy samples in geometric/statistical contexts.

Sometimes the term “denoise” is replaced with “declutter” in the literature on stochastic geometry. A general “declutter algorithm”, depending on a single parameter has bee recently proposed in Buchet et al. (2015). This paper includes also a short interesting overview of the literature on the topic. In particular, the authors mention two main general declutter methodologies, namely procedures based on deconvolution (where the distribution generating the noise appears convolved with the “true” underlying model), see Caillerie et al. (2013), and those based on thresholding, Ozertem and Erdogmus (2011), where the data are “cleaned” using an auxiliary density estimator.

Another interesting approach to the denoising idea, different to that followed in this paper, is given in Chazal et al. (2011). These authors tackle the identification of some geometric or topological features from samples that could include outliers. Again, they use the $r$ -offsets (that is the $r$ -parallel sets of the sample data and the target set $S$ ) as a fundamental tool. Such $r$ -offsets are represented in terms of sublevel sets of appropriate functions, defined as a short of distance between a point and a set. The main contribution in the mentioned paper is to robustify (against outliers) such distance functions, and the corresponding sublevel sets, by replacing them with a new function that can be seen as a distance between a point and a probability distribution. A recent related approach, based on the use of kernel density estimates, can be found in Phillips et al (2015).

The denoising idea is also alike to that of identifying (from a sample of points on the set $S$ ) the “central part” of the set, often called “skeleton” or “medial axis” of $S$ . See Cuevas et al. (2014) and references therein. In fact, the possible idea of defining a denoising procedure in terms of distance to the medial axis, could be seen as a sort of “dual” version of the method proposed in the present paper, based on the distance to the estimated boundary.

Closely related ideas, ultimately relying on the notion of medial axis, are considered in Dey et al. (2015), where a method for “sparsification” of a sample is proposed. The aim is also (as in the denoising case) to retain a subset of the original sample, which is assumed to be drawn on a manifold. In authors’ words: “We sparsify the data so that the resulting set is locally uniform and is still good for homology inference”. The proposed method is based on the “lean feature size” distance, which is intermediate between the well-know “local feature size” (defined in terms of the medial axis) and the “weak local feature size”.

4.1 The algorithm

Let $\mathcal{M}\subset\mathbb{R}^{d}$ be a compact set with ${\rm reach}(\mathcal{M})=R_{0}>0$ . Let $\mathcal{Y}_{n}=\{Y_{1},\dots,Y_{n}\}$ be an iid sample of a random variable $Y$ , with absolutely continuous distribution whose support is the parallel set $S=B(\mathcal{M},R_{1})$ for some $0<R_{1}<R_{0}$ . We now propose an algorithm to get from $\mathcal{Y}_{n}$ , a “partially de-noised” sample of points ${\mathcal{Z}}_{m}$ that allow us to estimate the target set $\mathcal{M}$ , as established in Theorem 5.

The procedure works as follows:

Take suitable auxiliary estimators for $S$ and $R_{1}$ . Let $\hat{S}_{n}$ be an estimator of $S$ (based on $\mathcal{Y}_{n}$ ) such that $d_{H}(\partial\hat{S}_{n},\partial S)<a_{n}$ eventually a.s., for some $a_{n}\rightarrow 0$ . Let $\hat{R}_{n}$ be an estimator of $R_{1}$ such that $|\hat{R}_{n}-R_{1}|\leq e_{n}$ eventually a.s. for some $e_{n}\rightarrow 0$ . 2. 2.

Select a $\lambda$ -subsample far from the estimated boundary of $S$ . Take $\lambda\in(0,1)$ and define $\mathcal{Y}^{\lambda}_{m}=\{Y^{\lambda}_{1},\dots,Y^{\lambda}_{m}\}\subset\mathcal{Y}_{n}$ where $Y^{\lambda}_{i}\in\mathcal{Y}_{m}^{\lambda}$ if and only if $d(Y^{\lambda}_{i},\partial\hat{S}_{n})>\lambda\hat{R}_{n}$ . 3. 3.

The projection + translation stage. For every $Y_{i}^{\lambda}\in\mathcal{Y}_{m}^{\lambda}$ , we define $\mathcal{Z}_{m}=\{Z_{1},\dots,Z_{m}\}$ as follows,

[TABLE]

where $\pi_{\partial\hat{S}_{n}}(Y_{i}^{\lambda})$ denotes the metric projection of $Y_{i}^{\lambda}$ on $\partial\hat{S}_{n}$ .

4.2 Asymptotics

The following result shows that the above de-noising procedure allows us to asymptotically recover the “inner set” $\mathcal{M}$ .

Theorem 5.

Let $\mathcal{M}\subset\mathbb{R}^{d}$ be a compact set with ${\rm reach}(\mathcal{M})=R_{0}>0$ . Let $\mathcal{Y}_{n}=\{Y_{1},\dots,Y_{n}\}$ be an iid sample of $Y$ , with support $S=B(\mathcal{M},R_{1})$ for some $0<R_{1}<R_{0}$ , and distribution $P_{Y}$ , absolutely continuous with respect to the Lebesgue measure, whose density $f$ , is bounded from below by $f_{0}>0$ . Let $a_{n}$ and $e_{n}$ be, respectively, the convergence rates in the estimation of $\partial S$ , as defined in the algorithm of Sunsection 4.1. Then, there exists $b_{n}=\mathcal{O}\left(\max(a_{n}^{1/3},e_{n},\varepsilon_{n})\right)$ such that, with probability one, for $n$ large enough,

[TABLE]

where $\varepsilon_{n}=c(\log(n)/n)^{1/d}$ with $c>(6/(f_{0}\omega_{d}))^{1/d}$ and $\mathcal{Z}_{m}$ denotes the denoised sample defined in the algorithm.

Proof.

First let us prove that $d_{H}(\mathcal{Y}_{n},S)\leq\varepsilon_{n}$ eventually a.s.. To do that, we will use Theorem 4 in Cuevas and Rodriguez-Casal (2004) as it was done in Theorem 3. By Corollary 4.9 in Federer (1959), $\text{reach}(\overline{S^{c}})>0$ and then by Proposition 1, $S$ is standard. Again by Corollary 4.9 in Federer (1959) $\text{reach}(\partial S)>0$ , which entails, by Proposition 2 that $S$ fulfils the outside rolling condition. Using Theorem 4 and Proposition 1 in Cuevas and Rodriguez-Casal (2004) we conclude that, $d_{H}(\mathcal{Y}_{n},S)\leq\varepsilon_{n}$ eventually a.s..

Let us fix $Y_{i}^{\lambda}\in\mathcal{Y}_{m}^{\lambda}$ .

Let us denote $l=\|Y_{i}^{\lambda}-\pi_{\partial S}(Y_{i}^{\lambda})\|$ and $\eta_{i}=(Y_{i}^{\lambda}-\pi_{\partial S}(Y_{i}^{\lambda}))/l$ , let us introduce two estimators $\hat{l}=\|Y_{i}^{\lambda}-\pi_{\partial\hat{S}_{n}}(Y_{i}^{\lambda})\|$ and $\hat{\eta}_{i}=(Y_{i}^{\lambda}-\pi_{\partial\hat{S}_{n}}(Y_{i}^{\lambda}))/\hat{l}$ . With this notation $Z_{i}=\pi_{\partial\hat{S}_{n}}(Y_{i}^{\lambda})+\hat{R}_{n}\hat{\eta}_{i}$ . Recall that since ${\rm reach(\mathcal{M})}>R_{1}$ we have (by Corollary 4.9 in Federer (1959)) that $\pi_{\mathcal{M}}(Y_{i}^{\lambda})=\pi_{\partial S}(Y_{i}^{\lambda})+R_{1}\eta_{i}$ ,

For all $Y_{i}^{\lambda}$ there exists a point $x\in\partial\hat{S}_{n}$ with $||x-\pi_{\partial S}(Y_{i}^{\lambda})||\leq a_{n}$ so that, by triangular inequality: $d(Y_{i}^{\lambda},\partial\hat{S}_{n})\leq l+a_{n}$ that is,

[TABLE]

Now let us prove that

[TABLE]

Suppose by contradiction that $\pi_{\partial\hat{S}_{n}}(Y_{i}^{\lambda})\in\mathcal{B}(Y_{i}^{\lambda},l-a_{n})$ , since $d_{H}(\partial S_{n},\partial S)<a_{n}$ there exists $t\in\partial S$ such that $\|t-\pi_{\partial\hat{S}_{n}}(Y_{i}^{\lambda})\|<a_{n}$ , but then $l=d(Y_{i}^{\lambda},\partial S)\leq\|Y_{i}^{\lambda}-\pi_{\partial\hat{S}_{n}}(Y_{i}^{\lambda})\|+\|\pi_{\partial\hat{S}_{n}}(Y_{i}^{\lambda})-t\|<l$ . That concludes the proof of (24).

By (23) and (24) we have:

[TABLE]

In the same way it can be proved that

[TABLE]

Let us prove that there exists $C_{0}>0$ such that

[TABLE]

First consider the case $0\leq R_{1}-l\leq a_{n}^{1/3}$ , which implies that $\|Y_{i}^{\lambda}-\pi_{\mathcal{M}}(Y_{i}^{\lambda})\|\leq a_{n}^{1/3}$ . Notice that, by (25), $\|Y_{i}^{\lambda}-Z_{i}\|=|\hat{R}_{n}-\hat{l}|\leq a_{n}^{1/3}+e_{n}+a_{n}$ , finally we get

[TABLE]

Now we consider the case $R_{1}-l\geq a_{n}^{1/3}$ , recall that by (23) and (26) we have.

[TABLE]

In Figure 3 it is represented the case for which $\|\pi_{\partial\hat{S}_{n}}(Y_{i}^{\lambda})-\pi_{\partial S}(Y_{i}^{\lambda})\|$ takes its largest possible value.

To find an upper bound for such value, let us first note that the points $\pi_{\partial\hat{S}_{n}}(Y_{i}^{\lambda})$ , $Y_{i}^{\lambda}$ and $\pi_{\partial\hat{S}_{n}}(Y_{i}^{\lambda})+R_{1}\hat{\eta}_{i}$ are aligned. And the points $\pi_{\partial S}(Y_{i}^{\lambda})$ , $Y_{i}^{\lambda}$ and $\pi_{\mathcal{M}}(Y_{i}^{\lambda})$ , are aligned. So all of them are in the same plane $\Pi$ . Let us now apply a translation T in order to get, $T(\pi_{\partial S}(Y_{1}^{\lambda}))=0$ . Let us consider in $\Pi$ a coordinate system $(x,y)$ such that $\pi_{\mathcal{M}}(Y_{i}^{\lambda})=(0,-R_{1})$ .

Let $(x_{1},y_{1})$ be the coordinates of the point $\pi_{\partial\hat{S}_{n}}(Y_{i}^{\lambda})$ . From (29) we get

[TABLE]

If we multiply (31) by $-l$ , we get $-l(x_{1}^{2}+y_{1}^{2})-2y_{1}lR_{1}\leq-la_{n}^{2}+2a_{n}lR_{1}$ and if we multiply (30) by $R_{1}$ we get $R_{1}(x_{1}^{2}+y_{1}^{2})+2y_{1}lR_{1}\leq 2R_{1}a_{n}l+a_{n}^{2}R_{1}$ . Then, if we sum this two inequalities we get,

[TABLE]

Notice that $Z_{i}\in\Pi$ , let us denote $(x,y)$ the coordinates of $Z_{i}$ in $\Pi$ , then

[TABLE]

and

[TABLE]

Since the coordinates of $\pi_{\mathcal{M}}(Y^{\lambda}_{i})$ are $(0,-R_{1})$ we get that

[TABLE]

Observe that $|l-\hat{l}|\leq a_{n}$ and $|R_{1}-\hat{R}_{n}|\leq e_{n}$ eventually almost surely. We can bound $|\frac{\hat{l}-\hat{R}_{n}}{\hat{l}}|\leq 2$ and $\hat{l}\geq\lambda R_{1}/2$ , then

[TABLE]

Finally by equations (32), (33) and (34), if $R_{1}-l\geq a_{n}^{1/3}$ (note that this is used in the proof of (32)), there exists $C_{0}$ such that

[TABLE]

where (see (32)) we are using $y_{1}=\mathcal{O}(a_{n}^{1/3})$ here. That concludes the proof of (27).

Let us finally prove that $\mathcal{M}\subset B(\mathcal{Z}_{m},a_{n}+e_{n}+2\varepsilon_{n})$ eventually, a.s. As indicated at the beginning of the proof, we have $d_{H}(\mathcal{Y}_{n},S)\leq\varepsilon_{n}$ eventually a.s., thus for all $x\in\mathcal{M}$ , there exists $Y_{i}\in\mathcal{Y}_{n}$ such that $\|x-Y_{i}\|\leq\varepsilon_{n}$ . For $n$ large enough we have $Y_{i}\in\mathcal{Y}_{m}^{\lambda}$ . Following the same ideas used to prove (28) we obtain $\|Z_{i}-Y_{i}\|\leq\varepsilon_{n}+a_{n}+e_{n}$ . By triangular inequality we get

[TABLE]

Combining (28), (35) and (36) we obtain,

[TABLE]

∎

Remark 3.

Note that, when $\mathring{\mathcal{M}}=\emptyset$ , the result simplifies since, according to Theorem 3 we can take $e_{n}=2\epsilon_{n}$ and, according to Cuevas and Rodriguez-Casal (2004) (Prop. 1 and Th. 4) $a_{n}=\epsilon_{n}$ . Therefore, in this case $b_{n}=a_{n}^{1/3}$ .

The two following corollaries give the exact convergence rate for the denoising process introduced before, using the centers of the boundary balls (Corollary 1), and the boundary of the $r$ -convex hull (Corollary 2), as estimators of the boundary of the support.

Corollary 1.

Let $\mathcal{M}\subset\mathbb{R}^{d}$ be a compact set such that $\emph{reach}(\mathcal{M})=R_{0}>0$ . Let $\mathcal{Y}_{n}=\{Y_{1},\dots,Y_{n}\}$ be an iid sample of a distribution $P_{Y}$ with support $B(\mathcal{M},R_{1})$ for some $0<R_{1}<R_{0}$ . Assume that $P_{Y}$ is absolutely continuous with respect to the Lebesgue measure and the density $f$ , is bounded from below by a constant $f_{0}>0$ . Let $\varepsilon_{n}=c(\log(n)/n)^{1/d}$ and $c>(6/(f_{0}\omega_{d}))^{1/d}$ .

Given $\lambda\in(0,1)$ , let $\mathcal{Z}_{n}$ be the points obtained after the denoising process using $\hat{R}_{n}$ to estimate $R_{1}$ and $\{Y_{i},i\in I_{bb}\}$ as an estimator of $\partial S$ where $I_{bb}=\{j:\mathcal{B}(Y_{j},\varepsilon_{n})\text{ is a boundary ball}\}$ . Then,

[TABLE]

Using the assumption of $r$ -convexity for $\mathcal{M}$ (see Definitions 2 and 3 and the subsequent comments) in the construction of the set estimator, we can replace $\hat{R}_{n}$ with $\tilde{R}_{n}$ (see Theorem 4). Then, at the cost of some additional complexity in the numerical implementation, a faster convergence rate can be obtained. This is made explicit in the following result.

Corollary 2.

Let $\mathcal{M}\subset\mathbb{R}^{d}$ be a compact $d^{\prime}$ -dimensional set (in the sense of Theorem 4, i) such that $\emph{reach}(\mathcal{M})=R_{0}>0$ . Let $\mathcal{Y}_{n}=\{Y_{1},\dots,Y_{n}\}$ be an iid sample of a distribution $P_{Y}$ with support $B(\mathcal{M},R_{1})$ for some $0<R_{1}<R_{0}$ . Assume that $P_{Y}$ is absolutely continuous with respect to the Lebesgue measure and the density $f$ , is bounded from below by a constant $f_{0}>0$ .

For a given $\lambda\in(0,1)$ , let $\mathcal{Z}_{n}$ be the set of the points obtained after the denoising process, based on the estimator $\partial C_{r}(\mathcal{Y}_{n})$ of $\partial S$ (for some $r$ with $0<r<\min(R_{0}-R_{1},R_{1})$ ) and the estimator $\tilde{R}_{n}$ of $R_{1}$ .

Then,

[TABLE]

5 Estimation of lower-dimensional measures

5.1 Noiseless model

In this section, we go back to the noiseless model, that is, we assume that the sample points $X_{1},\ldots,X_{n}$ are drawn according to a distribution whose support is $\mathcal{M}$ . The target is to estimate the $d^{\prime}$ -dimensional Minkowski content of $\mathcal{M}$ , as given by

[TABLE]

This is just (alongside with Hausdorff measure, among others) one of the possible ways to measure lower-dimensional sets; see Mattila (1995) for background.

In recent years, the problem of estimating the $d^{\prime}$ -dimensional measures of a compact set from a random sample has received some attention in the literature. The simplest situation corresponds to the full-dimensional case $d^{\prime}=d$ . Any estimator $\mathcal{M}_{n}$ of $\mathcal{M}$ consistent with respect to the distance in measure, that is $\mu_{d}(\mathcal{M}_{n}\Delta\mathcal{M})\to 0$ (in prob. or a.s., where $\Delta$ stands for the symmetric difference), will provide a consistent estimator for $\mu_{d}(\mathcal{M})$ . In fact, as a consequence of Th. 1 in Devroye and Wise (1980) (recall that $S$ is compact here) this will the always the case (in probability) when $\mathcal{M}_{n}$ is the offset estimator (6), provided that $\mu_{d}$ is absolutely continuous (on $\mathcal{M}$ ) with respect to $P_{X}$ together with $r_{n}\to 0$ and $nr_{n}^{d}\to\infty$ .

Other more specific estimators of $\mu_{d}(\mathcal{M})$ can be obtained by imposing some shape assumptions on $\mathcal{M}$ , such as convexity or $r$ -convexity, which are incorporated to the estimator $\mathcal{M}_{n}$ ; see Arias-Castro et al. (2016), Baldin and Reiss (2016), Pardon (2011).

Regarding the estimation of lower-dimensional measures, with $d^{\prime}<d$ , the available literature mostly concerns the problem of estimating $L_{0}(\mathcal{M})$ , $\mathcal{M}$ being the boundary of some compact support $S$ . The sample model is also a bit different, as it is assumed that we have sample points inside and outside $S$ . Here, typically, $d^{\prime}=d-1$ ; see Cuevas et al. (2007), Cuevas et al. (2013), Jiménez and Yukich (2011).

Again, in the case $\mathcal{M}=\partial S$ with $d=2$ , under the extra assumption of $r$ -convexity for $S$ , the consistency of the plug-in estimator $L_{0}(\partial C_{r}({\mathcal{X}}_{n}))$ of $L_{0}(\partial S)$ is proved in Cuevas, Fraiman and Pateiro-López (2012) under the usual inside model (points taken on $S$ ). Finally, in Berrendero et al. (2014), assuming an outside model (points drawn in $B(S,R)\setminus S$ ), estimators of $\mu_{d}(S)$ and $L_{0}(\partial S)$ are proposed, under the condition of polynomial volume for $S$

From the perspective of the above references, our contribution here (Th. 6 below) could be seen as a sort of lower-dimensional extension of the mentioned results of type $\mu_{d}(\mathcal{M}_{n})\to\mu_{d}(\mathcal{M})$ regarding volume estimation. But, obviously, in this case the Lebesgue measure $\mu_{d}$ must be replaced with a lower-dimensional counterpart, such as the Minkowski content (37). We will also need the following lower-dimensional version of the standardness property given in Definition 3.

Definition 7.

A Borel probability measure defined on a $d^{\prime}$ -dimensional set $\mathcal{M}\subset\mathbb{R}^{d}$ (considered with the topology induced by $\mathbb{R}^{d}$ ) is said to be standard with respect to the $d^{\prime}$ -dimensional Lebesgue measure $\mu_{d^{\prime}}$ if there exist $\lambda$ and $\delta$ such that, for all $x\in\mathcal{M}$ ,

[TABLE]

Remark 4.

Observe that, by Lemma 5.3 in (Niyogi, Smale and Weinberger (2008)) this condition is fulfilled if $P_{X}$ has a density $f$ bounded from below and $\mathcal{M}$ is a manifold with positive condition number (also known as positive reach). Standardness of the distribution has also been used in cue04, Chazal et al. (2015), Aamari and Levrard (2015).

Theorem 6.

Let $\mathcal{X}_{n}=\{X_{1},\dots,X_{n}\}$ be an iid sample drawn according to a distribution $P_{X}$ on a set $\mathcal{M}\subset\mathbb{R}^{d}$ . Let us assume that the distribution $P_{X}$ is standard with respect to the $d^{\prime}$ -dimensional Lebesgue measure (see 7) and that there exists the $d^{\prime}$ Minkowski content $L_{0}(\mathcal{M})<\infty$ of $\mathcal{M}$ , given by (37). Let us take $r_{n}$ such that $r_{n}\rightarrow 0$ and $(\log(n)/n)^{1/d^{\prime}}=o(r_{n})$ , then

(i)

[TABLE]

(ii)

If $\emph{reach}(\mathcal{M})=R_{0}>0$ , then

[TABLE]

where $\beta_{n}=\mathcal{O}\big{(}\log(n)/n\big{)}^{1/d^{\prime}}$ .

Proof.

(i) First we will see that, following the same ideas as in Theorem 3 in Cuevas and Rodriguez-Casal (2004) it can be readily proved that, with probability one, for $n$ large enough,

[TABLE]

for some large enough constant $C>0$ . In order to see (39), let us consider $M_{\Delta}$ a minimal covering of $\mathcal{M}$ , with balls of radius $\Delta$ centred in $N_{\Delta}$ points belonging to $\mathcal{M}$ . Let us prove that $N_{\Delta}=\mathcal{O}(\Delta^{-d^{\prime}})$ . Indeed, since $M_{\Delta}$ is a minimal covering it is clear that $\mu_{d}(B(\mathcal{M},\Delta))\geq N_{\Delta}\omega_{d}(\Delta/2)^{d}$ , and then

[TABLE]

$c_{1}$ being a positive constant. Since there exists $L_{0}(\mathcal{M})$ it follows that $N_{\Delta}=\mathcal{O}(\Delta^{-d^{\prime}})$ . Then the proof of (39) follows easily from the standardness of $P_{X}$ and $N_{\Delta}=\mathcal{O}(\Delta^{-d^{\prime}})$ , so we will omit it.

Now, in order to prove (38), let us first prove that, if we take $\alpha_{n}=1-C\beta_{n}/r_{n}$ ,

[TABLE]

To prove this, consider $x_{n}\in B(\mathcal{M},\alpha_{n}r_{n})$ , then there exists $t_{n}\in\mathcal{M}$ such that $x_{n}\in\mathcal{B}(t_{n},\alpha_{n}r_{n})$ . Since $d_{H}(\mathcal{X}_{n},\mathcal{M})\leq C\beta_{n}$ there exists $y_{n}\in\mathcal{B}(t_{n},C\beta_{n})$ , $y_{n}\in\mathcal{X}_{n}$ . It is enough to prove that $x_{n}\in\mathcal{B}(y_{n},r_{n})$ . But this follows from the fact that, eventually a.s.,

[TABLE]

Then, from (40)

[TABLE]

Since there exists $L_{0}(\mathcal{M})$ , the right hand side of (41) goes to zero. To prove that the left hand side of (41) goes to zero, let us observe that, as $\alpha_{n}=1-C\beta_{n}/r_{n}$ , and $\alpha_{n}^{d-d^{\prime}}=1-\mathcal{O}(\beta_{n}/r_{n})$ , then

[TABLE]

since $\alpha_{n}\rightarrow 1$ and $\beta_{n}/r_{n}\rightarrow 0$ we get

[TABLE]

(ii) The assumption $\emph{reach}(\mathcal{M})=R_{0}>0$ allow us to ensure that $\mathcal{M}$ has a polynomial volume in the interval $[0,r_{0})$ . This means that, for $r<R_{0}$ , $\mu\big{(}B(\mathcal{M},r)\big{)}=P_{d}(r)$ where $P_{d}(r)$ is a polynomial of degree at most $d$ ; this is a classical result due to Federer (1959, Th. 5.6). Since we assume that the $d^{\prime}$ -Minkowski content $L_{0}(\mathcal{M})$ is finite, this polynomial volume condition entails that the coefficient to the $d-d^{\prime}$ term is $\omega_{d-d^{\prime}}L_{0}(\mathcal{M})$ . Then,

[TABLE]

for some constant $A(\mathcal{M})$ . Now the proof follows from (41) and (42). ∎

Remark 5.

In the case of sets with positive reach, part (b) suggests to take $r_{n}^{2}=\max_{i}\min_{j\neq i}\|X_{i}-X_{j}\|$ since we know by Theorem 1 in Penrose (1999) that $r_{n}^{2}=\mathcal{O}\big{(}(\log(n)/n)^{1/d^{\prime}})$ that gives the optimal convergence rate.

5.2 Noisy Model

The estimation of the Minkowski content in the noisy model has been tackled in Berrendero et al. (2014), where the random sample is assumed to have uniform distribution in the parallel set $U$ . In this section we will see that even if the sample is not uniformly distributed on $B(\mathcal{M},R_{1})$ for some $0<R_{1}<R_{0}={\rm reach}(\mathcal{M})$ , it is still possible, by applying first the de-noising algorithm introduced in Section 4, to estimate $L_{0}(\mathcal{M})$ . Following the notation in Section 4, let $\mathcal{Y}_{n}$ be an iid sample of a random variable $Y$ with support $B(\mathcal{M},R_{1})$ , let us denote $\mathcal{Z}_{m}$ the de-noised sample defined by (22). The estimator is defined as in (38) but replacing $\mathcal{X}_{n}$ with $\mathcal{Z}_{m}$ . Although the subset $\mathcal{Z}_{m}$ is not an iid sample (since the random variables $Z_{i}$ are not independent), the consistency is based on the fact that $\mathcal{Z}_{m}$ converge in Hausdorff distance to $\mathcal{M}$ , as we will prove in the following theorem.

Theorem 7.

With the hypothesis and notation of Theorem 5, if $\max(a_{n}^{1/3},e_{n},\varepsilon_{n})=o(r_{n})$ where $\varepsilon_{n}=c(\log(n)/n)^{1/d}$ with $c>(6/(f_{0}\omega_{d}))^{1/d}$ . Then,

[TABLE]

Proof.

The proof is analogous to the one in Theorem 6. Observe that in Theorem 5 we proved that $d_{H}(\mathcal{Z}_{m},\mathcal{M})\leq b_{n}$ , for some $b_{n}=\mathcal{O}(\max(a_{n}^{1/3},e_{n},\varepsilon_{n}))$ , then $b_{n}/r_{n}\rightarrow 0$ . As we did Theorem 6 if we take $\alpha_{n}=1-b_{n}/r_{n}$ , then, with probability one,

[TABLE]

then we get

[TABLE]

from where it follows

[TABLE]

Since $\alpha_{n}\rightarrow 1$ and $b_{n}/r_{n}\rightarrow 0$ we get (43). ∎

6 Computational aspects and simulations

We discuss here some theoretical and practical aspects regarding the implementation of the algorithms. We present also some simulations and numerical examples.

6.1 Identifying the boundary balls

The cornerstone of the practical use of Theorem 1 is the effective identification of the boundary balls. The following proposition provides the basis for such identification, in terms of the Voronoi cells of the sample points. Recall that, given a finite set $\{x_{1},\dots,x_{n}\}$ , the Voronoi cell associated with the point $x_{i}$ is defined by ${\rm\text{Vor}}(x_{i})=\{x:d(x,x_{i})\leq d(x,x_{j})\text{ for all }i\neq j\}$ .

Proposition 3.

Let $\mathcal{X}_{n}=\{X_{1},\ldots,X_{n}\}$ be an $iid$ sample of points, in $\mathbb{R}^{d}$ , drawn according to a distribution $P_{X}$ , absolutely continuous with respect to the Lebesgue measure. Then, with probability one, for all $i=1,\dots,n$ and all $r>0$ , $\sup\{\|z-X_{i}\|,z\in\emph{Vor}(X_{i})\}\geq r$ if and only if $\mathcal{B}(X_{i},r)$ is a boundary ball for the Devroye-Wise estimator (6).

Proof.

Let us take $r>0$ and $X_{i}$ such that there exists $z\in\partial\mathcal{B}(X_{i},r)\cap\mbox{Vor}(X_{i})\neq\emptyset$ , let us prove that $z\in\partial\hat{S}_{n}(r)$ . Observe that since $z\in\mbox{Vor}(X_{i})$ , $d(z,\mathcal{X}_{n}\setminus X_{i})\geq r$ thus $d(z,\mathcal{X}_{n})=r$ . Reasoning by contradiction suppose that $z\in\mathring{\hat{S}}_{n}$ then, with probability one, there exists $j_{0}$ such that $z\in\mathring{\mathcal{B}}(X_{j_{0}},r)$ and so $\|z-X_{j_{0}}\|<r$ that is a contradiction.

Now to prove the converse implication let us assume that $\mathcal{B}(X_{i},r)$ is a boundary ball, then there exists $z\in\partial\mathcal{B}(X_{i},r)$ such that $z\in\partial\hat{S}_{n}(r)$ . Let us prove that $d(z,\mathcal{X}_{n}\setminus X_{i})\geq r$ (from where it follows that $z\in\mbox{Vor}(X_{i})$ ). Suppose that $d(z,\mathcal{X}_{n}\setminus X_{i})<r$ , then there exists $X_{j}\neq X_{i}$ such that $d(z,X_{j})<r$ and then $\mathcal{B}\big{(}z,r-d(z,X_{j})\big{)}\subset\mathring{\hat{S}}_{n}(r)$ . ∎

6.2 An algorithm to detect empty interior in the noiseless case using Theorem 1

In order to use in practice Theorem 1 to detect lower-dimensionality in the noiseless case, we need to fix a sequence $r_{n}\downarrow 0$ under the conditions indicated in Theorem 1 (ii). Note that this requires to assume lower bounds for the “thickness” constant $\rho({\mathcal{M}})=\sup d(x,\partial{\mathcal{M}})$ and the standardness constant $\delta$ (see Definition 4) as well as an upper bound for the radius of the outer rolling ball.

Now, according to Theorem 1, and Proposition 3, we will use the following algorithm.

For $i=1,\dots,n$ , let $V^{i}=\{V_{1}^{i},\ldots V_{k_{i}}^{i}\}$ be the vertices of $\text{Vor}(X_{i})$ ,

2)

Let $\delta_{i}=\sup\{\|z-X_{i}\|,z\in\text{Vor}(X_{i})\}=\max\{\|X_{i}-V_{k}^{i}\|,1\leq k\leq k_{i}\}$ , since $\text{Vor}(X_{i})$ is a convex polyhedron. In the case that $\text{Vor}(X_{i})$ is an unbounded cell we put $\delta_{i}=\infty$ . Define $\delta_{0}=\min_{i}\delta_{i}$ .

3)

Decide $\mathring{\mathcal{M}}\neq\emptyset$ if and only if $\delta_{0}\geq r_{n}$ .

6.3 On the estimation of the maximum distance to the boundary

Theorems 3 and 4 involve the calculation of quantities such as $d(x,\partial\hat{S}_{n}(\epsilon_{n}))$ and $d(x,\partial C_{r}({\mathcal{Y}}_{n}))$ , where $\hat{S}_{n}(\epsilon_{n})$ is a Devroye-Wise estimator of type (6) and $C_{r}({\mathcal{Y}}_{n})$ is the $r$ -convex hull (2) of ${\mathcal{Y}}_{n}$ .

It is somewhat surprising to note that, in spite of the much simpler structure of $\hat{S}_{n}(\epsilon_{n})$ when compared to $C_{r}({\mathcal{Y}}_{n})$ , the distance to the boundary $d(x,\partial C_{r}({\mathcal{Y}}_{n}))$ can be calculated in a simpler, more accurate way than the analogous quantity $d(x,\partial\hat{S}_{n}(\epsilon_{n}))$ for the Devroye-Wise estimator $\hat{S}_{n}(\epsilon_{n}))$ .

Indeed note that $d(x,\partial C_{r}(\mathcal{Y}_{n}))$ is relatively simple to calculate; this is done in Berrendero, Cuevas and Pateiro-López (2012) in the two-dimensional case although can be in fact used in any dimension. Observe first that $\partial C_{r}(\mathcal{Y}_{n}))$ is included in a finite union of spheres of radius $r$ , with centres in $Z=\{z_{1},\dots,z_{m}\}$ . Then $d(x,\partial C_{r}(\mathcal{Y}_{n}))=\min_{z_{i}\in Z}\|x-z_{i}\|-r$ . In order to find $Z$ we need to compute the Delaunay triangulation. Recall that the Delaunay triangulation, $\text{Del}(\mathcal{Y}_{n})$ , is defined as follows. Let $\tau\subset\mathcal{Y}_{n}$ ,

[TABLE]

Observe finally, for any dimension, $\bigcap_{Y_{i}\in\tau}\text{Vor}(Y_{i})\neq\emptyset$ is a segment or a half line. If $\tau_{i}$ is the $d$ -dimensional simplex with vertices $\{Y_{i_{1}},\ldots,Y_{i_{d}}\}\subset\partial\mathcal{B}(z_{i},r)$ , the point $z_{i}$ can be obtained as $\bigcap_{Y^{i}_{j}\in\tau_{i}}\text{Vor}(Y_{i})\cap\mathcal{B}(Y^{i}_{1},r)$ .

6.4 Experiments

The general aim of these experiments is not to make an extensive, systematic empirical study. We are just trying to show that the methods and algorithm proposed here can be implemented in practice.

Detection of full dimensionality. We consider here a simple illustration of the use of Theorem 1 and the associated algorithm. In each case, we draw 200 samples of sizes $n=$ 50, 100, 200, 300, 400, 500, 1000, 2000, 5000, 10000 on the $A$ -parallel set around the unit sphere, $\partial\mathcal{B}(0,1)\subset\mathbb{R}^{d}$ ; that is, the sample data are selected on $\mathcal{B}(0,1+A)\setminus\mathring{\mathcal{B}}(0,1-A)$ . The width parameter $A$ takes the values $A=0,0.01,0.05,0.1,\ldots,0.05$ . Table 1 provides the minimum sample sizes to “safely decide” the correct answer. This means to correctly decide on, at least 190 out of 200 considered samples, that the support is lower dimensional (in the case $A=0$ ) or that it is full dimensional (cases with $A>0$ ).

We have used the boundary balls procedure (here and in the denoising experiment below for $A=0$ ) with $r=2\max_{i}(\min_{j\neq i}\|X_{j}-X_{i}\|)$ .

The results look quite reasonable: the larger the dimension $d$ and the smaller the width parameter $A$ , the harder the detection problem.

Denoising. We draw points on $\mathcal{B}(0,1.3)\setminus\mathring{\mathcal{B}}(0,0.7)$ in $\mathbb{R}^{2}$ and $\mathbb{R}^{3}$ .

In order to evaluate the effectiveness of the denoising procedure we define the random variable $e=\|Y\|-1$ from the denoised data $Y$ and also from the original data. Note that the “perfect” denoising would correspond to $e=0$ . The Figure 4 shows the kernel estimators of both densities of $e$ for the case $d=2$ (left panel) and for $d=3$ (right panel). These estimators for the denoised case are based on $m=100$ values of $e$ extracted from samples of sizes $n=$ 100, 1000, 10000. The density estimators for the initial distribution are based on samples of size 100. Clearly, when the denoised sample of size $m=100$ is based on a very large sample, with $n=10000$ , the denoising process is better, as suggested by the fact that the corresponding density estimators are strongly concentrated around 0. The slight asymmetry in the three dimensional case, accounts for the fact that the “external” volume $\mathcal{B}(0,1.3)\setminus\mathcal{B}(0,1)$ is larger than the “internal” one $\mathcal{B}(0,1)\setminus\mathcal{B}(0,0.7)$ .

Figures 5 and 6 provide a more visual idea on the result of the denoising algorithm. They correspond, respectively, to the set $\mathcal{B}(S_{L_{3}},0.3)$ (where $S_{L_{3}}=\{(x,y),|x|^{3}+|y|^{3}=1\}$ ) and to $\mathcal{B}(T,0.3)$ , where $T$ is the so-called Trefoil Knot, a well-known curve with interesting topological and geometric properties.

Minkowski contents estimation. Finally in Table 3 we show, just as a tentative experiment, some results about the Minkowki contents estimation, again in the case of noiseless data ( $R_{1}=0$ ) and noisy points (with $R_{1}$ =0.2) drawn around a sphere for different values for $n$ and different dimensions.

For every $R_{1},n,d$ we estimate the Minkowski contents using a radius $r=0.5\break\sqrt{\max_{i}(\min_{j\neq i}\|X_{i}-X_{j}\|)}$ (see Theorem 2) when $R_{1}=0$ and with a a deterministic radius $r=r_{0}(n,d)$ ( slowly decreasing with the dimension, see Table 2) when $R_{1}=0.2$ . The values of the estimators have been calculated via a Monte Carlo Method based on $10^{5}$ points uniformly drawn on $\mathcal{B}(0,1+2r)\setminus\mathring{\mathcal{B}}(0,1-2r)$ . For every $R_{1},n,d$ the experiment has been done $100$ times. Table 3 entries provide the average relative error (in percentage) in the estimation of the boundary Minkowski contents $L$ . That is, the entries are $100\cdot err(R_{1},d)$ where $err(R_{1},d)=\frac{1}{L}\sqrt{\sum_{i}(L_{i}(R_{1},d)-L)^{2}/100}$ , $L$ being the correct value of the boundary length in each case, that is $L=2\pi$ , $4\pi$ , $2\pi^{2}$ , for $d=2,3,4$ , respectively.

Even if we disregard the intrinsic difficulties associated with the Monte Carlo approximation, the outputs of Table 3 suggest that the denoising-based methodology for the estimation of the Minkowski content from noisy observations, is not accurate for large dimensions. Note however that the problem is intrinsically difficult, as shown by the convergence rates obtained in the noiseless case. Note also that the noise level $R_{1}=0.2$ is quite large, especially for $d=3,4$ . In any case, the results displayed in Figure 6 suggest a quite reasonable performance of the denoising procedure, for other descriptive or image analysis purposes. Clearly, more research would be needed to reach more definitive conclusions.

Acknowledgements

This research has been partially supported by MATH-AmSud grant 16-MATH-05 SM-HCD-HDD (C. Aaron and A. Cholaquidis) and Spanish grant MTM2016-78751-P (A. Cuevas). We are grateful to Luis Guijarro and Jesús Gonzalo (Dept. Mathematics, UAM, Madrid) for useful conversations and advice.

Bibliography61

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aamari and Levrard (2015) Aamari, E. and Levrard, C. (2015). Stability and minimax optimality of tangential Delaunay complexes for manifold reconstruction. ar Xiv preprint ar Xiv:1512.02857 v 1.
2Adler et al. (2016) Adler, R.J., Krishnan, S.R., Taylor, J.E. and Weinberger, S. (2015). Convergence of the reach for a sequence of Gaussian-embedded manifolds. ar Xiv preprint ar Xiv:1503.01733.
3Amenta et al. (2002) Amenta, N., Choi, S., Dey, T.K. and Leekha, N. (2002). A simple algorithm for homeomorphic surface reconstruction. Internat. J. Comput. Geom. Appl . 12 , 125–141.
4Ambrosio, Colesanti and Villa (2008) Ambrosio, L., Colesanti, A. and Villa, E. (2008). Outer Minkowski content for some classes of closed sets. Math. Ann. 342 , 727–748.
5Arias-Castro et al. (2016) Arias-Castro, E., Pateiro-López, B. and Rodríguez-Casal, A. (2016). Minimax estimation of the volume of a set with smooth boundary. ar Xiv preprint ar Xiv:1605.01333 v 1.
6Avila and Lyubich (2007) Avila A. and Lybich, M. (2007). Hausdorff dimension and conformal measures of Feigenbaum Julia sets. J. Am. Math. Soc. 21 , 305–363.
7Baldin and Reiss (2016) Baldin, N. and M. Reiss (2016). Unbiased estimation of the volume of a convex body. Stochastic Process. Appl. 126 , 3716–3732.
8Berrendero, Cuevas and Pateiro-López (2012) Berrendero, J.R., Cuevas, A.. and Pateiro-López, B. (2012). A multivariate uniformity test for the case of unknown support Stat. Comput. 22 , 259–271.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Abstract

1 Introduction

2 Some geometric background

Definition 1**.**

Definition 2**.**

Definition 3**.**

Definition 4**.**

Proposition 1**.**

Proof.

Proposition 2** (Lemma 2.3 in Pateiro-López and Rodríguez-Casal (2009)).**

Definition 5**.**

3 Checking closeness to lower dimensionality

3.1 The noiseless model

Definition 6**.**

Theorem 1**.**

Proof.

Remark 1**.**

Theorem 2**.**

Proof.

3.2 The case of noisy data: the “parallel” model

Theorem 3**.**

Proof.

Theorem 4**.**

Proof.

Remark 2**.**

3.3 An index of closeness to lower dimensionality

4 A method to partially denoise the sample data

4.1 The algorithm

4.2 Asymptotics

Theorem 5**.**

Proof.

Remark 3**.**

Corollary 1**.**

Corollary 2**.**

5 Estimation of lower-dimensional measures

5.1 Noiseless model

Definition 7**.**

Remark 4**.**

Theorem 6**.**

Proof.

Remark 5**.**

5.2 Noisy Model

Theorem 7**.**

Proof.

6 Computational aspects and simulations

6.1 Identifying the boundary balls

Proposition 3**.**

Proof.

6.2 An algorithm to detect empty interior in the noiseless case using Theorem 1

6.3 On the estimation of the maximum distance to the boundary

6.4 Experiments

Acknowledgements

Definition 1.

Definition 2.

Definition 3.

Definition 4.

Proposition 1.

Proposition 2 (Lemma 2.3 in Pateiro-López and Rodríguez-Casal (2009)).

Definition 5.

Definition 6.

Theorem 1.

Remark 1.

Theorem 2.

Theorem 3.

Theorem 4.

Remark 2.

Theorem 5.

Remark 3.

Corollary 1.

Corollary 2.

Definition 7.

Remark 4.

Theorem 6.

Remark 5.

Theorem 7.

Proposition 3.