Concentration inequalities for bounded functionals via generalized   log-Sobolev inequalities

Friedrich G\"otze; Holger Sambale; Arthur Sinulis

arXiv:1812.01092·math.PR·June 16, 2020

Concentration inequalities for bounded functionals via generalized log-Sobolev inequalities

Friedrich G\"otze, Holger Sambale, Arthur Sinulis

PDF

TL;DR

This paper establishes multilevel concentration inequalities for bounded functionals of random variables, extending to dependent cases and providing applications in empirical processes, chaos, U-statistics, and random graphs.

Contribution

It introduces new concentration inequalities based on generalized log-Sobolev inequalities, applicable to both independent and dependent variables, with explicit constants involving higher order differences.

Findings

01

Derived tail bounds for empirical processes and chaos.

02

Provided concentration inequalities for U-statistics with bounded kernels.

03

Extended results to dependent variables and random graph models.

Abstract

In this paper we prove multilevel concentration inequalities for bounded functionals $f = f (X_{1}, \dots, X_{n})$ of random variables $X_{1}, \dots, X_{n}$ that are either independent or satisfy certain logarithmic Sobolev inequalities. The constants in the tail estimates depend on the operator norms of $k$ -tensors of higher order differences of $f$ . We provide applications in both dependent and independent random variables. This includes deviation inequalities for empirical processes $f (X) = sup_{g \in F} ∣ g (X)∣$ and suprema of homogeneous chaos in bounded random variables in the Banach space case given by $f (X) = sup_{t} ∥ \sum_{i_{1} \neq = \dots \neq = i_{d}} t_{i_{1} \dots i_{d}} X_{i_{1}} \dots X_{i_{d}} ∥_{B}$ . The latter application is comparable to earlier results of Boucheron-Bousquet-Lugosi-Massart and provides the upper tail bounds of Talagrand. In…

Equations228

Ent_{ν} (f^{2}) \leq 2 \int ∣ \nabla f ∣^{2} d ν,

Ent_{ν} (f^{2}) \leq 2 \int ∣ \nabla f ∣^{2} d ν,

Ent_{μ} (f^{2}) \leq 2 σ^{2} \int Γ (f)^{2} d μ .

Ent_{μ} (f^{2}) \leq 2 σ^{2} \int Γ (f)^{2} d μ .

\operatorname{\mathbb{P}}\Big{(}\lvert f(X)-\operatorname{\mathbb{E}}f(X)\rvert\geq t\Big{)}\leq 2\exp\Big{(}-\frac{1}{C}\min_{k=1,\ldots,d}f_{k}(t)\Big{)}

\operatorname{\mathbb{P}}\Big{(}\lvert f(X)-\operatorname{\mathbb{E}}f(X)\rvert\geq t\Big{)}\leq 2\exp\Big{(}-\frac{1}{C}\min_{k=1,\ldots,d}f_{k}(t)\Big{)}

∣ A ∣_{op} : = v^{1}, \dots, v^{d} \in R^{n} ∣ v^{j} ∣ \leq 1 sup ⟨ v^{1} \dots v^{d}, A ⟩ = v^{1}, \dots, v^{d} ∣ v^{j} ∣ \leq 1 sup i_{1}, \dots, i_{d} \sum v_{i_{1}}^{1} \dots v_{i_{d}}^{d} A_{i_{1} \dots i_{d}},

∣ A ∣_{op} : = v^{1}, \dots, v^{d} \in R^{n} ∣ v^{j} ∣ \leq 1 sup ⟨ v^{1} \dots v^{d}, A ⟩ = v^{1}, \dots, v^{d} ∣ v^{j} ∣ \leq 1 sup i_{1}, \dots, i_{d} \sum v_{i_{1}}^{1} \dots v_{i_{d}}^{d} A_{i_{1} \dots i_{d}},

T_{i} f : = T_{i} f (X) : = f (X_{i^{c}}, X_{i}^{'}) = f (X_{1}, \dots, X_{i - 1}, X_{i}^{'}, \linebreak [2] X_{i + 1}, \dots, X_{n})

T_{i} f : = T_{i} f (X) : = f (X_{i^{c}}, X_{i}^{'}) = f (X_{1}, \dots, X_{i - 1}, X_{i}^{'}, \linebreak [2] X_{i + 1}, \dots, X_{n})

h_{i} f (X) = ∥ f (X) - T_{i} f (X) ∥_{i, \infty}, h f (X) = (h_{1} f (X), \dots, h_{n} f (X)),

h_{i} f (X) = ∥ f (X) - T_{i} f (X) ∥_{i, \infty}, h f (X) = (h_{1} f (X), \dots, h_{n} f (X)),

\displaystyle\begin{split}\mathfrak{h}_{i_{1}\ldots i_{d}}f(X)=\;&\Big{\lVert}\,\prod_{s=1}^{d}\,(\mathrm{Id}-T_{i_{s}})f(X)\Big{\rVert}_{i_{1},\ldots,i_{d},\infty}\\ =\;&\Big{\lVert}\,f(X)+\sum_{k=1}^{d}\,(-1)^{k}\sum_{1\leq s_{1}<\ldots<s_{k}\leq d}T_{i_{s_{1}}\ldots i_{s_{k}}}f(X)\,\Big{\rVert}_{i_{1},\ldots,i_{d},\infty}\end{split}

\displaystyle\begin{split}\mathfrak{h}_{i_{1}\ldots i_{d}}f(X)=\;&\Big{\lVert}\,\prod_{s=1}^{d}\,(\mathrm{Id}-T_{i_{s}})f(X)\Big{\rVert}_{i_{1},\ldots,i_{d},\infty}\\ =\;&\Big{\lVert}\,f(X)+\sum_{k=1}^{d}\,(-1)^{k}\sum_{1\leq s_{1}<\ldots<s_{k}\leq d}T_{i_{s_{1}}\ldots i_{s_{k}}}f(X)\,\Big{\rVert}_{i_{1},\ldots,i_{d},\infty}\end{split}

h_{ij} f (X) = ∥ f (X) - T_{i} f (X) - T_{j} f (X) + T_{ij} f (X) ∥_{i, j, \infty} .

h_{ij} f (X) = ∥ f (X) - T_{i} f (X) - T_{j} f (X) + T_{ij} f (X) ∥_{i, j, \infty} .

\displaystyle\big{(}\mathfrak{h}^{(d)}f(X)\big{)}_{i_{1}\ldots i_{d}}=\begin{cases}\mathfrak{h}_{i_{1}\ldots i_{d}}f(X),&\text{if $i_{1},\ldots,i_{d}$ are distinct},\\ 0,&\text{else}.\end{cases}

\displaystyle\big{(}\mathfrak{h}^{(d)}f(X)\big{)}_{i_{1}\ldots i_{d}}=\begin{cases}\mathfrak{h}_{i_{1}\ldots i_{d}}f(X),&\text{if $i_{1},\ldots,i_{d}$ are distinct},\\ 0,&\text{else}.\end{cases}

\displaystyle\operatorname{\mathbb{P}}\left(\lvert f-\operatorname{\mathbb{E}}f\rvert\geq t\right)\leq 2\exp\Big{(}-\frac{1}{C}\min_{k=1,\ldots,d-1}\Big{(}\frac{t}{\lVert\mathfrak{h}^{(k)}f\rVert_{\mathrm{op},1}}\Big{)}^{2/k}\wedge\Big{(}\frac{t}{\lVert\mathfrak{h}^{(d)}f\rVert_{\mathrm{op},\infty}}\Big{)}^{2/d}\Big{)}.

\displaystyle\operatorname{\mathbb{P}}\left(\lvert f-\operatorname{\mathbb{E}}f\rvert\geq t\right)\leq 2\exp\Big{(}-\frac{1}{C}\min_{k=1,\ldots,d-1}\Big{(}\frac{t}{\lVert\mathfrak{h}^{(k)}f\rVert_{\mathrm{op},1}}\Big{)}^{2/k}\wedge\Big{(}\frac{t}{\lVert\mathfrak{h}^{(d)}f\rVert_{\mathrm{op},\infty}}\Big{)}^{2/d}\Big{)}.

\operatorname{\mathbb{P}}\left(\lvert f-\operatorname{\mathbb{E}}f\rvert\geq t\right)\leq 2\exp\Big{(}-\frac{1}{CM^{2}}\min\Big{(}\frac{t^{2}}{\lvert A\rvert_{\mathrm{HS}}^{2}},\frac{t}{\lvert A^{\mathrm{abs}}\rvert_{\mathrm{op}}}\Big{)}\Big{)}.

\operatorname{\mathbb{P}}\left(\lvert f-\operatorname{\mathbb{E}}f\rvert\geq t\right)\leq 2\exp\Big{(}-\frac{1}{CM^{2}}\min\Big{(}\frac{t^{2}}{\lvert A\rvert_{\mathrm{HS}}^{2}},\frac{t}{\lvert A^{\mathrm{abs}}\rvert_{\mathrm{op}}}\Big{)}\Big{)}.

∣ d f ∣^{2} (x)

∣ d f ∣^{2} (x)

= i = 1 \sum n \frac{1}{2} \iint (f (x_{i^{c}}, y) - f (x_{i^{c}}, y^{'}))^{2} d μ (y ∣ x_{i^{c}}) d μ (y^{'} ∣ x_{i^{c}}) .

E (f, f) : = i = 1 \sum n \int Var_{μ (\cdot ∣ x_{i^{c}})} (f (x_{i^{c}}, \cdot)) d μ_{i^{c}} (x_{i^{c}}) = \int ∣ d f ∣^{2} d μ .

E (f, f) : = i = 1 \sum n \int Var_{μ (\cdot ∣ x_{i^{c}})} (f (x_{i^{c}}, \cdot)) d μ_{i^{c}} (x_{i^{c}}) = \int ∣ d f ∣^{2} d μ .

\displaystyle\operatorname{\mathbb{P}}\left(\lvert f-\operatorname{\mathbb{E}}f\rvert\geq t\right)\leq 2\exp\Big{(}-\frac{1}{C}\min_{k=1,\ldots,d-1}\Big{(}\frac{t}{\lVert\mathfrak{h}^{(k)}f\rVert_{\mathrm{op},1}}\Big{)}^{2/k}\wedge\Big{(}\frac{t}{\lVert\mathfrak{h}^{(d)}f\rVert_{\mathrm{op},\infty}}\Big{)}^{2/d}\Big{)}.

\displaystyle\operatorname{\mathbb{P}}\left(\lvert f-\operatorname{\mathbb{E}}f\rvert\geq t\right)\leq 2\exp\Big{(}-\frac{1}{C}\min_{k=1,\ldots,d-1}\Big{(}\frac{t}{\lVert\mathfrak{h}^{(k)}f\rVert_{\mathrm{op},1}}\Big{)}^{2/k}\wedge\Big{(}\frac{t}{\lVert\mathfrak{h}^{(d)}f\rVert_{\mathrm{op},\infty}}\Big{)}^{2/d}\Big{)}.

g (X) : = g_{F} (X) : = f \in F sup ∣ f (X)∣ .

g (X) : = g_{F} (X) : = f \in F sup ∣ f (X)∣ .

\operatorname{\mathbb{P}}(g-\operatorname{\mathbb{E}}g\geq t)\leq 2\exp\Big{(}-\frac{1}{C}\min\Big{(}\min_{j=1,\ldots,d-1}\Big{(}\frac{t}{\operatorname{\mathbb{E}}W_{j}}\Big{)}^{2/k},\frac{t^{2/d}}{\lVert W_{d}\rVert_{\infty}}\Big{)}\Big{)}

\operatorname{\mathbb{P}}(g-\operatorname{\mathbb{E}}g\geq t)\leq 2\exp\Big{(}-\frac{1}{C}\min\Big{(}\min_{j=1,\ldots,d-1}\Big{(}\frac{t}{\operatorname{\mathbb{E}}W_{j}}\Big{)}^{2/k},\frac{t^{2/d}}{\lVert W_{d}\rVert_{\infty}}\Big{)}\Big{)}

g(X)\coloneqq\sup_{f\in\mathcal{F}}\Big{\lvert}\sum_{j=1}^{n}f(X_{j})\Big{\rvert}.

g(X)\coloneqq\sup_{f\in\mathcal{F}}\Big{\lvert}\sum_{j=1}^{n}f(X_{j})\Big{\rvert}.

\operatorname{\mathbb{P}}\Big{(}g\geq\operatorname{\mathbb{E}}g+t\Big{)}\leq 2\exp\Big{(}-\frac{t^{2}}{15\sigma^{2}n\sup_{f\in\mathcal{F}}c(f)^{2}}\Big{)}.

\operatorname{\mathbb{P}}\Big{(}g\geq\operatorname{\mathbb{E}}g+t\Big{)}\leq 2\exp\Big{(}-\frac{t^{2}}{15\sigma^{2}n\sup_{f\in\mathcal{F}}c(f)^{2}}\Big{)}.

\displaystyle f(X)\coloneqq f_{\mathcal{T}}(X)\coloneqq\sup_{t\in\mathcal{T}}\Big{\lVert}\sum_{I\in\mathcal{I}_{n,d}}X_{I}t_{I}\Big{\rVert},

\displaystyle f(X)\coloneqq f_{\mathcal{T}}(X)\coloneqq\sup_{t\in\mathcal{T}}\Big{\lVert}\sum_{I\in\mathcal{I}_{n,d}}X_{I}t_{I}\Big{\rVert},

\displaystyle\begin{split}W_{k}&\coloneqq\sup_{t\in\mathcal{T}}\sup_{v^{*}\in\mathcal{B}_{1}^{*}}\sup_{\begin{subarray}{c}\alpha^{1},\ldots,\alpha^{k}\in\operatorname{\mathbb{R}}^{n}\\ \lvert\alpha^{i}\rvert\leq 1\end{subarray}}v^{*}\Big{(}\sum_{\begin{subarray}{c}i_{1},\ldots,i_{k}\\ \mathrm{distinct}\end{subarray}}\alpha_{i_{1}}^{1}\cdots\alpha_{i_{k}}^{k}\sum_{\begin{subarray}{c}I\in\mathcal{I}_{n,d-k}\\ i_{1},\ldots,i_{k}\notin I\end{subarray}}X_{I}t_{I\cup\{i_{1},\ldots,i_{k}\}}\Big{)}\\ &=\sup_{t\in\mathcal{T}}\sup_{\begin{subarray}{c}\alpha^{1},\ldots,\alpha^{k}\in\operatorname{\mathbb{R}}^{n}\\ \lvert\alpha^{i}\rvert\leq 1\end{subarray}}\Big{\lVert}\sum_{\begin{subarray}{c}i_{1},\ldots,i_{k}\\ \mathrm{distinct}\end{subarray}}\alpha_{i_{1}}^{1}\cdots\alpha_{i_{k}}^{k}\sum_{\begin{subarray}{c}I\in\mathcal{I}_{n,d-k}\\ i_{1},\ldots,i_{k}\notin I\end{subarray}}X_{I}t_{I\cup\{i_{1},\ldots,i_{k}\}}\Big{\rVert},\end{split}

\displaystyle\begin{split}W_{k}&\coloneqq\sup_{t\in\mathcal{T}}\sup_{v^{*}\in\mathcal{B}_{1}^{*}}\sup_{\begin{subarray}{c}\alpha^{1},\ldots,\alpha^{k}\in\operatorname{\mathbb{R}}^{n}\\ \lvert\alpha^{i}\rvert\leq 1\end{subarray}}v^{*}\Big{(}\sum_{\begin{subarray}{c}i_{1},\ldots,i_{k}\\ \mathrm{distinct}\end{subarray}}\alpha_{i_{1}}^{1}\cdots\alpha_{i_{k}}^{k}\sum_{\begin{subarray}{c}I\in\mathcal{I}_{n,d-k}\\ i_{1},\ldots,i_{k}\notin I\end{subarray}}X_{I}t_{I\cup\{i_{1},\ldots,i_{k}\}}\Big{)}\\ &=\sup_{t\in\mathcal{T}}\sup_{\begin{subarray}{c}\alpha^{1},\ldots,\alpha^{k}\in\operatorname{\mathbb{R}}^{n}\\ \lvert\alpha^{i}\rvert\leq 1\end{subarray}}\Big{\lVert}\sum_{\begin{subarray}{c}i_{1},\ldots,i_{k}\\ \mathrm{distinct}\end{subarray}}\alpha_{i_{1}}^{1}\cdots\alpha_{i_{k}}^{k}\sum_{\begin{subarray}{c}I\in\mathcal{I}_{n,d-k}\\ i_{1},\ldots,i_{k}\notin I\end{subarray}}X_{I}t_{I\cup\{i_{1},\ldots,i_{k}\}}\Big{\rVert},\end{split}

W_{k}

W_{k}

\displaystyle=\sup_{\begin{subarray}{c}\alpha^{1},\ldots,\alpha^{k}\in\operatorname{\mathbb{R}}^{n}\\ \lvert\alpha^{i}\rvert\leq 1\end{subarray}}\sum_{\begin{subarray}{c}i_{1},\ldots,i_{k}\\ \mathrm{distinct}\end{subarray}}\alpha_{i_{1}}^{1}\cdots\alpha_{i_{k}}^{k}\sup_{t\in\mathcal{T}}\Big{\lVert}\sum_{\begin{subarray}{c}I\in\mathcal{I}_{n,d-k}\\ i_{1},\ldots,i_{k}\notin I\end{subarray}}X_{I}t_{I\cup\{i_{1},\ldots,i_{k}\}}\Big{\rVert}.

∥(f - E f)_{+} ∥_{p}

∥(f - E f)_{+} ∥_{p}

∥ f - E f ∥_{p}

\begin{split}\operatorname{\mathbb{P}}\left(f-\operatorname{\mathbb{E}}f\geq t\right)&\leq 2\exp\Big{(}-\frac{1}{2\sigma^{2}(b-a)^{2}}\min_{k=1,\ldots,d}\Big{(}\frac{t}{de\operatorname{\mathbb{E}}W_{k}}\Big{)}^{2/k}\Big{)}\\ &\leq 2\exp\Big{(}-\frac{1}{2e^{2}\sigma^{2}(b-a)^{2}d^{2}}\min_{k=1,\ldots,d}\Big{(}\frac{t}{\operatorname{\mathbb{E}}W_{k}}\Big{)}^{2/k}\Big{)}\end{split},

\begin{split}\operatorname{\mathbb{P}}\left(f-\operatorname{\mathbb{E}}f\geq t\right)&\leq 2\exp\Big{(}-\frac{1}{2\sigma^{2}(b-a)^{2}}\min_{k=1,\ldots,d}\Big{(}\frac{t}{de\operatorname{\mathbb{E}}W_{k}}\Big{)}^{2/k}\Big{)}\\ &\leq 2\exp\Big{(}-\frac{1}{2e^{2}\sigma^{2}(b-a)^{2}d^{2}}\min_{k=1,\ldots,d}\Big{(}\frac{t}{\operatorname{\mathbb{E}}W_{k}}\Big{)}^{2/k}\Big{)}\end{split},

T_{1}

T_{1}

T_{2}

\displaystyle\operatorname{\mathbb{P}}\left(f_{\mathcal{T}}(X)-\operatorname{\mathbb{E}}f_{\mathcal{T}}(X)\geq t\right)\leq 2\exp\Big{(}-\frac{1}{60(b-a)^{2}\sigma^{2}}\min\Big{(}\frac{t^{2}}{T_{1}^{2}},\frac{t}{T_{2}}\Big{)}\Big{)}.

\displaystyle\operatorname{\mathbb{P}}\left(f_{\mathcal{T}}(X)-\operatorname{\mathbb{E}}f_{\mathcal{T}}(X)\geq t\right)\leq 2\exp\Big{(}-\frac{1}{60(b-a)^{2}\sigma^{2}}\min\Big{(}\frac{t^{2}}{T_{1}^{2}},\frac{t}{T_{2}}\Big{)}\Big{)}.

T_{1}=\operatorname{\mathbb{E}}\sup_{t\in\mathcal{T}}\Big{(}\sum_{i=1}^{n}\Big{(}\sum_{j=1}^{n}t_{ij}X_{j}\Big{)}^{2}\Big{)}^{1/2},\qquad T_{2}=\sup_{t\in\mathcal{T}}\lvert T\rvert_{\mathrm{op}},

T_{1}=\operatorname{\mathbb{E}}\sup_{t\in\mathcal{T}}\Big{(}\sum_{i=1}^{n}\Big{(}\sum_{j=1}^{n}t_{ij}X_{j}\Big{)}^{2}\Big{)}^{1/2},\qquad T_{2}=\sup_{t\in\mathcal{T}}\lvert T\rvert_{\mathrm{op}},

f (x) = S \subset [n] \sum \hat{f}_{S} x_{S} = j \in [n] \sum S \subseteq [n] : ∣ S ∣ = j \sum \hat{f}_{S} x_{S},

f (x) = S \subset [n] \sum \hat{f}_{S} x_{S} = j \in [n] \sum S \subseteq [n] : ∣ S ∣ = j \sum \hat{f}_{S} x_{S},

\operatorname{\mathbb{P}}(\lvert f(X)-\operatorname{\mathbb{E}}f(X)\rvert\geq t)\leq\exp\Big{(}1-\min_{j=1,\ldots,d}\Big{(}\frac{t}{deW_{j}(f)^{1/2}}\Big{)}^{2/j}\Big{)}.

\operatorname{\mathbb{P}}(\lvert f(X)-\operatorname{\mathbb{E}}f(X)\rvert\geq t)\leq\exp\Big{(}1-\min_{j=1,\ldots,d}\Big{(}\frac{t}{deW_{j}(f)^{1/2}}\Big{)}^{2/j}\Big{)}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

∎

11institutetext: Friedrich Götze 22institutetext: Holger Sambale 33institutetext: Arthur Sinulis 44institutetext: Fakultät für Mathematik

Universität Bielefeld

Postfach 10 01 31

33501 Bielefeld

Germany

44email: [email protected]

Concentration inequalities for bounded functionals via log-Sobolev-type inequalities††thanks: This research was supported by the German Research Foundation (DFG) via CRC 1283 “Taming uncertainty and profiting from randomness and low regularity in analysis, stochastics and their applications”.

Friedrich Götze

Holger Sambale

Arthur Sinulis∗

(Received: date / Accepted: date)

Abstract

In this paper we prove multilevel concentration inequalities for bounded functionals $f=f(X_{1},\ldots,X_{n})$ of random variables $X_{1},\ldots,X_{n}$ that are either independent or satisfy certain logarithmic Sobolev inequalities. The constants in the tail estimates depend on the operator norms of $k$ -tensors of higher order differences of $f$ .

We provide applications for both dependent and independent random variables. This includes deviation inequalities for empirical processes $f(X)=\sup_{g\in\mathcal{F}}\lvert g(X)\rvert$ and suprema of homogeneous chaos in bounded random variables in the Banach space case $f(X)=\sup_{t}\lVert\sum_{i_{1}\neq\ldots\neq i_{d}}t_{i_{1}\ldots i_{d}}X_{i_{1}}\cdots X_{i_{d}}\rVert_{\mathcal{B}}$ . The latter application is comparable to earlier results of Boucheron–Bousquet–Lugosi–Massart and provides the upper tail bounds of Talagrand. In the case of Rademacher random variables, we give an interpretation of the results in terms of quantities familiar in Boolean analysis. Further applications are concentration inequalities for $U$ -statistics with bounded kernels $h$ and for the number of triangles in an exponential random graph model.

MSC:

60E15 05C80

1 Introduction

During the last forty years, the concentration of measure phenomenon has become an established part of probability theory with applications in numerous fields, as is witnessed by the monographs MS86 ; Led01 ; BLM13 ; RS14 ; vH16 . One way to prove concentration of measure is by using functional inequalities, more specifically the entropy method. It has emerged as a way to prove several groundbreaking concentration inequalities in product spaces by Talagrand Tal91 ; Tal96a , mainly in the works Led97 and BL97 , and further developed in Ma00 .

To convey the idea, let us recall that the logarithmic Sobolev inequality for the standard Gaussian measure $\nu$ in $\operatorname{\mathbb{R}}^{n}$ (see Gr75 ) states that for any $f\in C_{c}^{\infty}(\operatorname{\mathbb{R}}^{n})$ we have

[TABLE]

where $\operatorname{Ent}_{\nu}(f^{2})=\int f^{2}\log f^{2}d\nu-\int f^{2}d\nu\log\int f^{2}d\nu$ is the entropy functional. Informally, it bounds the disorder of a function $f$ (under $\nu$ ) by its average local fluctuations, measured in terms of the length of the gradient. It is by now standard that (1) implies subgaussian tail decay for Lipschitz functions (e. g. by means of the Herbst argument). In particular, if $f\colon\mathbb{R}^{n}\to\mathbb{R}$ is a $\mathcal{C}^{1}$ function such that $|\nabla f|\leq L$ a.s., we have $\nu(|f-\int fd\nu|\geq t)\leq 2\exp(-t^{2}/(2L^{2}))$ for any $t\geq 0$ .

If $\mu$ is a probability measure on a discrete set $\mathcal{X}$ (or a more abstract set not allowing for an immediate replacement for $\lvert\nabla f\rvert$ ), then there are several ways to reformulate equation (1), see e. g. DSC96 or BT06 . We continue these ideas by working in the framework of difference operators. Given a probability space $(\mathcal{Y},\mathcal{A},\mu)$ , we call any operator $\Gamma:L^{\infty}(\mu)\to L^{\infty}(\mu)$ satisfying $|\Gamma(af+b)|=a\,|\Gamma f|$ for all $a>0$ , $b\in\mathbb{R}$ a difference operator. Accordingly, we say that $\mu$ satisfies a $\Gamma\mathrm{-LSI}(\sigma^{2})$ , if for all bounded measurable functions $f$ we have

[TABLE]

Apart from the domain of $\Gamma$ , it is clear that (2) can be seen as generalization of (1) by defining $\Gamma(f)=\lvert\nabla f\rvert$ on $\operatorname{\mathbb{R}}^{n}$ .

Another route to obtain concentration inequalities is to modify the entropy method, which was done in the framework of so-called $\varphi$ -entropies. The idea is to replace the function $\varphi_{0}(x)\coloneqq x\log x$ in the definition of the entropy $\operatorname{Ent}^{\varphi_{0}}_{\mu}(f)=\operatorname{\mathbb{E}}_{\mu}\varphi_{0}(f)-\varphi_{0}(\operatorname{\mathbb{E}}_{\mu}f)$ by other functions $\varphi$ . This has been studied in LO00 ; BLM03 ; Cha04 . In the seminal work BBLM05 the authors proved inequalities for $\varphi$ -entropies for power functions $\varphi(x)=\lvert x\rvert^{\alpha},\alpha\in(1,2]$ , leading to moment inequalities for independent random variables.

Originally, the entropy method was primarily used to prove sub-Gaussian concentration inequalities for Lipschitz-type functions. However, there are many situations of interest in which the functions under consideration are not Lipschitz or have Lipschitz constants which grow as the dimension increases even after a renormalization which asymptotically stabilizes the variance. Among the simplest examples are polynomial-type functions. Here, the boundedness of the gradient typically has to be replaced by more elaborate conditions on higher order derivatives (up to some order $d$ ). Moreover, we cannot have subgaussian tail decay anymore. This is already obvious if we consider the product of two independent standard normal random variables, which leads to subexponential tails. We refer to this topic as higher order concentration.

The earliest higher order concentration results date back to the late 1960s. Already in Bo68 ; Bo70 and Ne73 , the growth of $L^{p}$ norms and hypercontractive estimates of polynomial-type functions in Rademacher or Gaussian random variables respectively have been studied. The question of estimating the growth of $L^{p}$ norms of multilinear polynomials in Gaussian random variables was considered in Bor84 , AG93 and La06 . In the context of Erdös–Rényi graphs and the triangle problem, concentration inequalities for polynomials functions gained considerable attention, in papers such as KV00 .

More recently, multilevel concentration inequalities have been proven in Ad06 ; Wo13 ; AW15 for many classes of functions. These included $U$ -statistics in independent random variables, functions of random vectors satisfying Sobolev-type inequalities and polynomials in sub-Gaussian random variables respectively. We refer to inequalities of the type

[TABLE]

as multilevel or higher order ( $d$ -th order) concentration inequalities. This means that the tails might have different decay properties in some regimes of $[0,\infty)$ . Usually, we have $f_{k}(t)=(t/C_{k})^{2/k}$ for some constant $C_{k}$ which typically depends on the $k$ -th order derivatives.

To convey the basic idea of multilevel concentration inequalities, let us once again consider the case $d=2$ , e. g. a quadratic form of independent, say, Gaussian random variables. As sketched above, in this case the tails decay subexponentially in general. By means of a multilevel concentration inequality (the so-called Hanson–Wright inequality, which we address in more detail at a later point), we can show that while for $t$ large, subexponential tail decay holds, for small $t$ we even get subgaussian decay. In this sense, multilevel concentration inequalities provide refined tail estimates which do not only cover the behavior for large $t$ .

Our own work started with a second order concentration inequality on the sphere in BCG17 and was continued in BGS18 for bounded functionals of various classes of random variables (e. g. independent random variables or in presence of a logarithmic Sobolev inequality (1)), and in GSS18 for weakly dependent random variables (e. g. the Ising model). In these papers, we studied higher order concentration, arriving at multi-level tail inequalities of type (3). If the underlying measure $\mu$ satisfies a logarithmic Sobolev inequality, (BGS18, , Corollary 1.11) yields $f_{k}(t)=(t/C_{k})^{2/k}$ with $C_{k}=(\int|f^{(k)}|_{\mathrm{op}}^{2}d\mu)^{1/2}$ for $k=1,\ldots,d-1$ and $C_{d}=\sup|f^{(d)}|_{\mathrm{op}}$ , where $\lvert f^{(k)}\rvert_{\mathrm{op}}$ denotes the operator norm of the respective tensors of $k$ -th order partial derivatives. A downside in both BGS18 and GSS18 is that for functions of independent or weakly dependent random variables, comparable estimates involve Hilbert–Schmidt instead of operator norms, leading to weaker estimates in general.

A central aspect of the present article is to fix this drawback by a slightly more elaborate approach. Here, we consider both independent and dependent random variables. In either case, we prove multilevel concentration inequalities of the same type, and apply them to different forms of functionals. We provide improvements of earlier higher order concentration results like (BGS18, , Theorem 1.1) or (GSS18, , Theorem 1.5), replacing the Hilbert–Schmidt norms appearing therein by operator norms. This leads to sharper bounds and a wider range of applicability.

A special emphasis is placed on providing uniform versions of the higher order concentration inequalities. By this, we mean that we consider functionals of supremum type $f(X)=\sup_{f\in\mathcal{F}}\lvert f(X)\rvert$ , which includes suprema of polynomial chaoses, or empirical processes. Two more applications are given by $U$ -statistics in independent and weakly dependent random variables as well as a triangle counting statistic in some models of random graphs, for which we prove concentration inequalities.

Notations. Throughout this note, $X=(X_{1},\ldots,X_{n})$ is a random vector taking values in some product space $\mathcal{Y}=\otimes_{i=1}^{n}\mathcal{X}_{i}$ (equipped with the product $\sigma$ -algebra) with law $\mu$ , defined on a probability space $(\Omega,\mathcal{A},\mathbb{P})$ . By abuse of language, we say that $X$ satisfies a $\Gamma\mathrm{-LSI}(\sigma^{2})$ , if its distribution does. In any finite-dimensional vector space, we let $\lvert\cdot\rvert$ be the Euclidean norm, and for brevity, we write $[q]\coloneqq\{1,\ldots,q\}$ for any $q\in\operatorname{\mathbb{N}}$ . Given a vector $x=(x_{j})_{j=1,\ldots,n}$ we write $x_{i^{c}}=(x_{j})_{j\neq i}$ . To any $d$ -tensor $A$ we define the Hilbert–Schmidt norm $\lvert A\rvert_{\mathrm{HS}}\coloneqq(\sum_{i_{1},\ldots,i_{d}}A_{i_{1}\ldots i_{d}}^{2})^{1/2}$ and the operator norm

[TABLE]

using the outer product $(v^{1}\cdots v^{d})_{i_{1}\ldots i_{d}}=\prod_{j=1}^{d}v^{j}_{i_{j}}$ . For brevity, for any random $k$ -tensor $A$ and any $p\in(0,\infty]$ we abbreviate $\lVert A\rVert_{\mathrm{HS},p}=(\operatorname{\mathbb{E}}\lvert A\rvert_{\mathrm{HS}}^{p})^{1/p}$ as well as $\lVert A\rVert_{\mathrm{op},p}=(\operatorname{\mathbb{E}}\lvert A\rvert_{\mathrm{op}}^{p})^{1/p}$ . Lastly, we ignore any measurability issues that may arise. Thus, we assume that all the suprema used in this work are either countable or defined as $\sup_{t\in T}=\sup_{F\subset T:F\text{ finite}}\sup_{t\in F}$ .

1.1 Main results

To formulate our main results, we introduce a difference operator labeled $\lvert\mathfrak{h}f\rvert$ which is frequently used in the method of bounded differences. Let $X^{\prime}=(X_{1}^{\prime},\ldots,X^{\prime}_{n})$ be an independent copy of $X$ , defined on the same probability space. Given $f(X)\in L^{\infty}(\mathbb{P})$ , define for each $i\in[n]$

[TABLE]

and

[TABLE]

where $\lVert\cdot\rVert_{i,\infty}$ denotes the $L^{\infty}$ -norm with respect to $(X_{i},X_{i}^{\prime})$ . The difference operator $\lvert\mathfrak{h}f\rvert$ is given as the Euclidean norm of the vector $\mathfrak{h}f$ .

We shall also need higher order versions of $\mathfrak{h}$ , denoted by $\mathfrak{h}^{(d)}f$ . They can be thought of as analogues of the $d$ -tensors of all partial derivatives of order $d$ in an abstract setting. To define the $d$ -tensor $\mathfrak{h}^{(d)}f$ , we specify it on its “coordinates”. That is, given distinct indices $i_{1},\ldots,i_{d}$ , we set

[TABLE]

where $T_{i_{1}\ldots i_{d}}=T_{i_{1}}\circ\ldots\circ T_{i_{d}}$ exchanges the random variables $X_{i_{1}},\ldots,X_{i_{d}}$ by $X^{\prime}_{i_{1}},\ldots,X^{\prime}_{i_{d}}$ , and $\lVert\cdot\rVert_{i_{1},\ldots,i_{d},\infty}$ denotes the $L^{\infty}$ -norm with respect to the random variables $X_{i_{1}},\ldots,X_{i_{d}}$ and $X_{i_{1}}^{\prime},\ldots,X_{i_{d}}^{\prime}$ . For instance, for $i\neq j$ ,

[TABLE]

Using the definition (4), we define tensors of $d$ -th order differences as follows:

[TABLE]

Whenever no confusion is possible, we omit writing the random vector $X$ , i. e. we freely write $f$ instead of $f(X)$ and $\mathfrak{h}^{(d)}f$ instead of $\mathfrak{h}^{(d)}f(X)$ .

Our first main theorem is a concentration inequality for general, bounded functionals of independent random variables $X_{1},\ldots,X_{n}$ .

Theorem 1.1

Let $X$ be a random vector with independent components, $f:\mathcal{Y}\to\operatorname{\mathbb{R}}$ a measurable function satisfying $f=f(X)\in L^{\infty}(\operatorname{\mathbb{P}})$ , $d\in\operatorname{\mathbb{N}}$ and define $C\coloneqq 217d^{2}$ . We have for any $t\geq 0$

[TABLE]

For the sake of illustration, let us consider the case of $d=2$ . Assuming that $X_{1},\ldots,X_{n}$ satisfy $\mathbb{E}X_{i}=0$ , $\mathbb{E}X_{i}^{2}=1$ and $\lvert X_{i}\rvert\leq M$ a.s., let $f(X)$ be the quadratic form $f(X)=\sum_{i<j}a_{ij}X_{i}X_{j}=X^{T}AX$ . Here, $a_{ij}\in\mathbb{R}$ for all $i<j$ , and $A$ is the symmetric matrix with zero diagonal and entries $A_{ij}=a_{ij}/2$ if $i<j$ . In this case, it is easy to see that $\lVert\mathfrak{h}f\rVert_{\mathrm{op},1}\leq\lVert\mathfrak{h}f\rVert_{\mathrm{op},2}\leq 4M\lvert A\rvert_{\mathrm{HS}}$ and $\lVert\mathfrak{h}^{(2)}f\rVert_{\mathrm{op},\infty}\leq 8M^{2}\lvert A^{\mathrm{abs}}\rvert_{\mathrm{op}}$ , where $A^{\mathrm{abs}}$ is the matrix given by $(A^{\mathrm{abs}})_{ij}=\lvert a_{ij}\rvert$ . As a result,

[TABLE]

This is a version of the famous Hanson–Wright inequality. For the various forms of the Hanson–Wright inequality we refer to HW71 ; W73 ; HKZ12 ; RV13 ; VW15 ; Ad15 ; ALM18 .

Note that by a modification of our proofs (using arguments especially adapted to polynomials), it is possible to replace $\lvert A^{\mathrm{abs}}\rvert_{\mathrm{op}}$ by $\lvert A\rvert_{\mathrm{op}}$ , thus avoiding the drawback of switching to a matrix with a possibly larger operator norm. See Section 2.1 and 2.4 for details. On the other hand, Theorem 1.1 allows for any function $f$ , not just quadratic forms, and the case of $d=2$ can in this sense be considered as generalization of the Hanson–Wright inequality.

For a certain class of weakly dependent random variables $X_{1},\ldots,X_{n}$ , we can prove similar estimates as in Theorem 1.1. To this end, we introduce another difference operator, which is more familiar in the context of logarithmic Sobolev inequalities for Markov chains as developed in DSC96 . Assume that $\mathcal{Y}=\otimes_{i=1}^{n}\mathcal{X}_{i}$ for some finite sets $\mathcal{X}_{1},\ldots,\mathcal{X}_{n}$ , equipped with a probability measure $\mu$ and let $\mu(\cdot\mid x_{i^{c}})$ denote the conditional measure (interpreted as a measure on $\mathcal{X}_{i}$ ) and $\mu_{i^{c}}$ the marginal on $\otimes_{j\neq i}\mathcal{X}_{j}$ . Finally, set

[TABLE]

This difference operator appears naturally in the Dirichlet form associated to the Glauber dynamic of $\mu$ , given by

[TABLE]

In the next theorem, we require a $\mathfrak{d}$ –LSI for the underlying random variables $X_{1},\ldots,X_{n}$ . A number of models which satisfy this assumption will be discussed below.

Theorem 1.2

Let $X=(X_{1},\ldots,X_{n})$ be a random vector satisfying a $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ and $f:\mathcal{Y}\to\operatorname{\mathbb{R}}$ a measurable function with $f=f(X)\in L^{\infty}(\operatorname{\mathbb{P}})$ . With the constant $C=15\sigma^{2}d^{2}>0$ we have for any $t\geq 0$

[TABLE]

Again, if $d=2$ , assuming that $\mathbb{E}X_{i}=0$ , $\mathbb{E}X_{i}^{2}=1$ , $\lvert X_{i}\rvert\leq M$ a.s. and $\mathbb{E}X_{i}X_{j}=0$ if $i\neq j$ , we arrive at a Hanson–Wright type inequality, this time including dependent situations. Similar results still hold if we remove the uncorrelatedness condition.

Let us discuss the $\mathfrak{d}$ –LSI condition in more detail. First, any collection of random independent variables $X_{1},\ldots,X_{n}$ with finitely many values satisfies a $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ with $\sigma^{2}$ depending on the minimal non-zero probability of the $X_{i}$ (cf. Proposition 6). In this situation, Theorem 1.1 and Theorem 1.2 only differ by constants.

However, the $\mathfrak{d}$ –LSI conditions also gives rise to numerous models of dependent random variables as in (GSS18, , Proposition 1.1) (the Ising model) or (SS18, , Theorem 3.1) (various different models). Let us recall some of them. The Ising model is the probability measure on $\{\pm 1\}^{n}$ defined by normalizing $\pi(\sigma)=\exp(\frac{1}{2}\sum_{i,j}J_{ij}\sigma_{i}\sigma_{j}+\sum_{i=1}^{n}h_{i}\sigma_{i})$ for a symmetric matrix $J=(J_{ij})$ with zero diagonal and some $h\in\mathbb{R}^{n}$ . In (GSS18, , Proposition 1.1), we have shown that if $\max_{i=1,\ldots,n}\sum_{j=1}^{n}\lvert J_{ij}\rvert\leq 1-\alpha$ and $\max_{i\in[n]}\lvert h_{i}\rvert\leq\widetilde{\alpha}$ , the Ising model satisfies a $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ with $\sigma^{2}$ depending on $\alpha$ and $\widetilde{\alpha}$ only. For the special case of $h=0$ and $J_{ij}=\beta$ for all $i\neq j$ , we obtain the Curie–Weiss model. Here, the two conditions required above reduce to $\beta<1$ .

Another simple model in which a $\mathfrak{d}$ –LSI holds is the random coloring model. If $G=(V,E)$ is a finite graph and $C=\{1,\ldots,k\}$ is a set of colors, we denote by $\Omega_{0}\subset C^{V}$ the set of all proper coloring, i. e. the set of all $\omega\in C^{V}$ such that $\{v,w\}\in E\Rightarrow\omega_{v}\neq\omega_{w}$ . In (SS18, , Theorem 3.1), we have shown that the uniform distribution on $\Omega_{0}$ satisfies a $\mathfrak{d}$ –LSI if the maximum degree $\Delta$ is uniformly bounded and $k\geq 2\Delta+1$ (strictly speaking, we consider sequences of graphs here). In (SS18, , Theorem 3.1), we moreover prove $\mathfrak{d}$ –LSIs for the (vertex-weighted) exponential random graph model and the hard-core model. We will further discuss the exponential random graph model in Section 2.4.

The common feature in all these models is that the dependencies which appear can be controlled (e. g. by means of a coupling matrix which measures the interactions between the particles of the system under consideration, cf. (GSS18, , Theorem 4.2)) in such a way that the model is not “too far” from a product measure. For instance, in the Curie–Weiss model, this just translates to $\beta<1$ .

As a final remark, we discuss the LSI property with respect to various difference operators in Section 5. In particular, we show that the restriction to finite spaces which is implicit in Theorem 1.2 is natural since the $\mathfrak{d}\mathrm{-LSI}$ property requires the underlying space to be finite. By contrast, we prove that any set of independent random variables $X_{1},\ldots,X_{n}$ satisfies an $\mathfrak{h}$ –LSI $(1)$ . However, it seems that it is not possible to use the entropy method based on $\mathfrak{h}$ –LSIs.

The upper bound in Theorem 1.2 admits a “uniform version”, i. e. we can prove deviation inequalities for suprema of functions, in the following sense. Let $\mathcal{F}$ be a family of uniformly bounded, real-valued, measurable functions and set

[TABLE]

For any $d\in\operatorname{\mathbb{N}}$ and $j=1,\ldots,d$ let $W_{j}=W_{j}(X)\coloneqq\sup_{f\in\mathcal{F}}\lvert\mathfrak{h}^{(j)}f(X)\rvert_{\mathrm{op}}$ .

Theorem 1.3

Assume that either $X_{1},\ldots,X_{n}$ are independent or $X$ satisfies a $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ and let $g=g(X)$ be as in (7). With the same constant $C$ as in Theorem 1.2 or 1.1 respectively, we have for any $t\geq 0$ the deviation inequality

[TABLE]

As mentioned before, Theorem 1.3 yields bounds for the upper tail only. The background is that the entropy method has certain limitations when it is applied to suprema of functions, cf. also Proposition 1 or Theorem 2.1 below. Roughly sketched, the reason is that when evaluating difference operators of suprema, if a positive part is involved we may typically choose a coordinate-independent maximizer of the terms involved. Without a positive part, this is no longer possible. See in particular the proof of Theorem 2.1, where we provide some further details.

Functionals of the form (7) have been considered in various works, starting from the first results in (Tal96a, , Theorem 1.4), and continued in (Rio02, , Théorème 1.1), (Ma00, , Theorem 3) and (Bo02, , Theorem 2.3) in the special case of

[TABLE]

Further research has been done in KR05 , (Sam07, , Section 3) and more recently (Mar18, , Proposition 5.4). In these works, Bennett-type inequalities have been proven for general independent random variables. Furthermore, (BBLM05, , Theorem 10) treats the case $g(X)=\sup_{t\in\mathcal{T}}\sum_{i=1}^{n}t_{i}X_{i}$ for Rademacher random variables $X_{i}$ and a compact set of vectors $\mathcal{T}\subset\operatorname{\mathbb{R}}^{n}$ . As a byproduct of our method, we prove a deviation inequality for $g$ which can be regarded as a uniform bounded differences inequality.

Proposition 1

Assume that $X=(X_{1},\ldots,X_{n})$ satisfies a $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ , let $g=g(X)$ be as in (8), and let $c(f)$ be such that $\lvert f(x)-f(y)\rvert\leq c(f)$ . For any $t\geq 0$ we have

[TABLE]

Let us put Proposition 1 into context. In the above mentioned works, the authors derive Bennett-type inequalities for independent random variables $X_{1},\ldots,X_{n}$ , whereas in our case the concentration inequalities have sub-Gaussian tails. It might be compared to the sub-Gaussian tail estimates for Bernoulli processes, see e. g. (Tal14, , Theorem 5.3.2). However, the $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ property is both more and less general. On the one hand, it is possible to include possibly dependent random vectors, but on the other hand for independent random variables it is only applicable if the $X_{i}$ take finitely many values.

1.2 Outline

In Section 2, we present a number of applications and refinements of our main results. Section 3 contains the proofs of our main theorems. The proofs of the results from Section 2 is deferred to Section 4. We close out the paper by discussing different forms of logarithmic Sobolev inequalities with respect to various difference operators in the last Section 5.

2 Applications

In the sequel, we consider various situations in which our results can be applied. Some of them can be regarded as sharpenings of our main theorems for functions which have a special structure.

2.1 Uniform bounds

If the functions under consideration are of polynomial type, we may somewhat refine the results from the previous section. Here we focus on uniform bounds as discussed in Theorem 1.3.

Let $\mathcal{I}_{n,d}$ denote the family of subsets of $[n]$ with $d$ elements, fix a Banach space $(\mathcal{B},\lVert\cdot\rVert)$ with its dual space $(\mathcal{B}^{*},\lVert\cdot\rVert_{*})$ , a compact subset $\mathcal{T}\subset\mathcal{B}^{\mathcal{I}_{n,d}}$ and let $\mathcal{B}_{1}^{*}$ be the $1$ -ball in $\mathcal{B}^{*}$ with respect to $\lVert\cdot\rVert_{*}$ . Let $X=(X_{1},\ldots,X_{n})$ be a random vector with support in $[a,b]^{n}$ for some real numbers $a<b$ and define

[TABLE]

where $X_{I}\coloneqq\prod_{i\in I}X_{i}$ . For any $k\in[d]$ we let

[TABLE]

where for $k=d$ we use the convention $\mathcal{I}_{n,0}=\{\emptyset\}$ and $X_{\emptyset}\coloneqq 1$ .

One can interpret the quantities $W_{k}$ as follows: If $f_{t}(x)=\sum_{I\in\mathcal{I}_{n,d}}x_{I}t_{I}$ is the corresponding polynomial in $n$ variables, and $\nabla^{(k)}f_{t}(x)$ is the $k$ -tensor of all partial derivatives of order $k$ , then $W_{k}=\sup_{t\in\mathcal{T}}\lvert\nabla^{(k)}f_{t}(X)\rvert_{\mathrm{op}}$ . In this sense, we are considering the same quantities as in Theorem 1.3 but replace the difference operator $\mathfrak{h}$ by formal derivatives of the polynomial under consideration.

Furthermore, the concentration inequalities are phrased with the help of the quantities

[TABLE]

Clearly $\widetilde{W}_{k}\geq W_{k}$ holds for all $k\in[d]$ .

Concentration properties for functionals as in (9) have been studied for independent Rademacher variables $X_{1},\ldots,X_{n}$ (i. e. $\operatorname{\mathbb{P}}(X_{i}=+1)=\operatorname{\mathbb{P}}(X_{i}=-1)=1/2$ ) and $\mathcal{B}=\mathbb{R}$ in (BBLM05, , Theorem 14) for all $d\geq 2$ , and under certain technical assumptions in Ad15 . We prove deviation inequalities in the weakly dependent setting, and afterwards discuss how these compare to the particular result in BBLM05 . It is easily possible to derive a similar result for functions of independent random variables (in the spirit of Theorem 1.1). As the corresponding proof is easily done by generalizing the proof of (BBLM05, , Theorem 14), we omit it.

Theorem 2.1

Let $X=(X_{1},\ldots,X_{n})$ be a random vector in $\operatorname{\mathbb{R}}^{n}$ with support in $[a,b]^{n}$ satisfying a $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ . For $f=f(X)$ as in (9) and all $p\geq 2$ we have

[TABLE]

Consequently, for any $t\geq 0$

[TABLE]

and the same concentration inequalities hold with $\operatorname{\mathbb{E}}W_{k}$ replaced by $\operatorname{\mathbb{E}}\widetilde{W}_{k}$ .

Note that independent Rademacher random variables satisfy a $\mathfrak{d}\mathrm{-LSI}(1)$ (see e. g. (Gr75, , Theorem 3) or (DSC96, , Example 3.1)). Therefore, we get back (BBLM05, , Theorem 14) from Theorem 2.1 (with slightly different constants). However, Theorem 2.1 moreover includes many models with dependencies like those discussed in the introduction. Therefore, it may be considered as a extension of (BBLM05, , Theorem 14) to dependent situations and moreover to coefficients from any Banach space $\mathcal{B}$ . For instance, we may consider an Ising chaos as a natural generalization of a Rademacher chaos to a dependent situation. In this case, Theorem 2.1 yields that that we still obtain basically the same concentration properties if the dependencies are sufficiently weak (which is guaranteed by the conditions outlined in the introduction).

To illustrate our results further, let us consider the case of $d=2$ separately. Here we write

[TABLE]

The following corollary follows directly from Theorem 2.1.

Corollary 1

Assume that $X=(X_{1},\ldots,X_{n})$ satisfies a $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ and is supported in $[a,b]^{n}$ and let $f_{\mathcal{T}}=f_{\mathcal{T}}(X)$ be as in (9) with $d=2$ . We have for all $t\geq 0$

[TABLE]

For the case of independent Rademacher variables, this recovers the upper tail in a famous result by Talagrand (Tal96a, , Theorem 1.2) on concentration properties of quadratic forms in Banach spaces, which has also been done in BBLM05 . Note that for $\mathcal{B}=\mathbb{R}$ , we have

[TABLE]

where $T$ is the symmetric matrix with zero diagonal and entries $T_{ij}=t_{ij}$ if $i<j$ . If $\mathcal{T}$ consists of a single element only, we have $T_{1}\leq\lvert T\rvert_{\mathrm{HS}}$ . Hence, Corollary 1 can be regarded as a generalized Hanson–Wright inequality.

2.2 The Boolean hypercube

The case of independent Rademacher random variables above can be interpreted in terms of quantities from Boolean analysis. Recall that any function $f:\{-1,+1\}^{n}\to\operatorname{\mathbb{R}}$ can be decomposed using the orthonormal Fourier–Walsh basis given by $(x_{S})_{S\subseteq[n]}$ for $x_{S}\coloneqq\prod_{i\in S}x_{i}$ . More precisely, we have

[TABLE]

where the $(\hat{f}_{S})_{S\subset[n]}$ are given by $\hat{f}_{S}=\int x_{S}fd\mu$ and are called the Fourier coefficients of $f$ . For any $j\in[n]$ we define the Fourier weight of order $j$ as $W_{j}(f)\coloneqq\sum_{S\subseteq[n]:\lvert S\rvert=j}\hat{f}_{S}^{2}$ . It is clear that $\lVert f\rVert_{2}^{2}=\sum_{j=0}^{n}W_{j}(f)$ . The following multilevel concentration inequality can now be easily deduced.

Proposition 2

Let $X_{1},\ldots,X_{n}$ be independent Rademacher random variables and let $f:\{1,+1\}^{n}\to\operatorname{\mathbb{R}}$ be a function given in the Fourier–Walsh basis as $f(x)=\sum_{j=0}^{d}\hat{f}_{S}x_{S}$ for some $d\in\operatorname{\mathbb{N}},d\leq n$ . For any $t>0$ we have

[TABLE]

In other words, the event $\lvert f(X)-\operatorname{\mathbb{E}}f(X)\rvert\leq de\max_{j=1,\ldots,d}(W_{j}(f)t^{j})^{1/2}$ holds with probability at least $1-\exp(1-t)$ .

The literature on Boolean functions is vast, and a modern overview is given in OD14 . Especially for concentration results we may highlight (AW15, , Theorem 1.4) (which in particular holds for Boolean functions), which we discuss further and partially generalize to dependent models in Section 2.4. Proposition 2 may be of interest due to the direct use of quantities from Fourier analysis. Finally, we should add that while many concentration results for Boolean functions like (AW15, , Theorem 1.4) or also Proposition 2 are valid for functions whose Fourier–Walsh decomposition stops at some order $d$ , Theorem 1.1 or Theorem 1.2 work for functions with Fourier–Walsh decomposition possibly up to order $n$ .

2.3 Concentration properties of $U$ -statistics

Another application of Theorems 1.1 and 1.2 are concentration properties of so-called $U$ -statistics which frequently arise in statistical theory. We refer to PG99 for an excellent monograph. More recently, concentration inequalities for $U$ -statistics have been considered in Ad06 , (AW15, , Section 3.1.2) and (BGS18, , Corollary 1.3).

Let $\mathcal{Y}=\mathcal{X}^{n}$ and assume that $X_{1},\ldots,X_{n}$ are either independent random variables, or the vector $X=(X_{1},\ldots,X_{n})$ satisfies a $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ . Let $h:\mathcal{X}^{d}\to\operatorname{\mathbb{R}}$ be a measurable, symmetric function with $h(X_{i_{1}},\ldots,X_{i_{d}})\in L^{\infty}(\operatorname{\mathbb{P}})$ for any $i_{1},\ldots,i_{d}$ , and define $B\coloneqq\max_{i_{1}\neq\ldots\neq i_{d}}\lVert h(X_{i_{1}},\ldots,X_{i_{d}})\rVert_{L^{\infty}(\operatorname{\mathbb{P}})}$ . We are interested in the concentration properties of the $U$ -statistic with kernel $h$ , i. e. of

[TABLE]

Proposition 3

Let $X=(X_{1},\ldots,X_{n})$ be as above and $f=f(X)$ be as in (14). There exists a constant $C>0$ (the same as in Theorems 1.1 and 1.2) such that for any $t\geq 0$

[TABLE]

and for some $C=C(d)$

[TABLE]

The normalization $n^{1/2-d}$ in (15) is of the right order for $U$ -statistics generated by a non-degenerate kernel $h$ , i. e. $\mathrm{Var}(\operatorname{\mathbb{E}}_{X_{1}}h(X_{1},\ldots,X_{d}))>0$ , see (PG99, , Remarks 4.2.5). In the case of i.i.d. random variables $X_{1},\ldots,X_{n}$ it states that

[TABLE]

whenever $\operatorname{\mathbb{E}}h(X_{1},\ldots,X_{d})^{2}<\infty$ . Actually, (15) shows that for $t\leq n^{1/2}$ we have sub-Gaussian tails for any finite $n\in\operatorname{\mathbb{N}}$ for bounded kernels $h$ .

Proposition 3 improves upon our old result (BGS18, , Corollary 1.3) by providing multilevel tail bounds, thus yielding much finer estimates than the exponential moment bound given in the earlier paper. Moreover, it does not only address independent random variables but also weakly dependent models. As compared to the results from Ad06 and (AW15, , Section 3.1.2), Proposition 3 covers different types of measures, since in Ad06 independent random variables were considered, while in AW15 a Sobolev-type inequality was required, which does not include the various discrete models for which a $\mathfrak{d}$ –LSI holds.

2.4 Polynomials and subgraph counts in exponential random graph models

Lastly, let us once again consider polynomial functions. The case of independent random variables has been treated in (AW15, , Theorem 1.4) under more general conditions, so we omit it and concentrate on weakly dependent random variables.

Let $f_{d}:\operatorname{\mathbb{R}}^{n}\to\operatorname{\mathbb{R}}$ be a multilinear (also called tetrahedral) polynomial of degree $d$ , i. e. of the form

[TABLE]

for symmetric $k$ -tensors $a^{k}$ with vanishing diagonal. Here, a $k$ -tensor $a^{k}$ is called symmetric, if $a^{k}_{i_{1}\ldots i_{k}}=a^{k}_{\sigma(i_{1})\ldots\sigma(i_{k})}$ for any permutation $\sigma\in\mathcal{S}_{k}$ , and the (generalized) diagonal is defined as $\Delta_{k}\coloneqq\{(i_{1},\ldots,i_{k}):\lvert\{i_{1},\ldots,i_{k}\}\rvert<k\}$ . Denote by $\nabla^{(k)}f$ the $k$ -tensor of all partial derivatives of order $k$ of $f$ .

For the next result, given some $d\in\operatorname{\mathbb{N}}$ , we recall a family of norms $\lVert\cdot\rVert_{\mathcal{I}}$ on the space of $d$ -tensors for each partition $\mathcal{I}=\{I_{1},\ldots,I_{k}\}$ of $\{1,\ldots,d\}$ . The family $\lVert\cdot\rVert_{\mathcal{I}}$ has been first introduced in La06 , where it was used to prove two-sided estimates for $L^{p}$ norms of Gaussian chaos, and the definitions given below agree with the ones from La06 as well as AW15 and AKPS18 . For brevity, write $P_{d}$ for the set of all partitions of $\{1,\ldots,d\}$ . For each $l=1,\ldots,k$ we denote by $x^{(l)}$ a vector in $\operatorname{\mathbb{R}}^{n^{I_{l}}}$ , and for a $d$ -tensor $A=(a_{i_{1},\ldots,i_{d}})$ set

[TABLE]

We can regard the $\lVert A\rVert_{\mathcal{I}}$ as a family of operator-type norms. In particular, it is easy to see that $\lVert A\rVert_{\{1,\ldots,d\}}=\lvert A\rvert_{\mathrm{HS}}$ and $\lVert A\rVert_{\{\{1\},\ldots,\{d\}\}}=\lvert A\rvert_{\mathrm{op}}$ .

The following result has been proven in the context of Ising models (in the Dobrushin uniqueness regime) in AKPS18 , and can easily be extended to any vector $X$ satisfying a $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ . By invoking the family of norms $\lVert\cdot\rVert_{\mathcal{I}}$ , it provides a refinement of our general result for the special case of multilinear polynomials.

Theorem 2.2

Let $X$ be a random vector supported in $[-1,+1]^{n}$ and satisfying a $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ , and $f_{d}=f_{d}(X)$ be as in (16). There exists a constant $C>0$ depending on $d$ only such that for all $t\geq 0$

[TABLE]

For illustration, let us once again consider the case of $d=2$ . In the notation of (16), we take $a^{1}=0$ and $a^{2}=A$ , i. e. $f_{2}(x)=x^{T}Ax$ for a symmetric matrix $A$ with vanishing diagonal. In this case, assuming the components of $X$ to be centered (so the the $k=1$ term vanishes), Theorem 2.2 reads

[TABLE]

i. e. we obtain a Hanson–Wright inequality in this situation. For higher orders, we arrive at similar bounds. Altogether, for the class of multilinear polynomials, Theorem 2.2 yields finer bounds than Theorem 1.2 (by virtue of the large class of norms involved), though for $d\geq 3$ explicit calculations of the norms involved can be difficult.

To point out one possible application, Theorem 2.2 can be used in the context of the exponential random graph model (ERGM). Let us briefly recall the definitions. Given $s\in\operatorname{\mathbb{N}}$ real numbers $\beta_{1},\ldots,\beta_{s}$ and simple graphs $G_{1},\ldots,G_{s}$ (with $G_{1}$ being a single edge by convention), the ERGM with parameter $\mathbf{\beta}=(\beta_{1},\ldots,\beta_{s},G_{1},\ldots,G_{s})$ is a probability measure on the space of all graphs on $n\in\operatorname{\mathbb{N}}$ vertices given by the weight function $\exp\left(\sum_{i=1}^{s}\beta_{i}n^{-\lvert V_{i}\rvert+2}N_{G_{i}}(x)\right)$ , where $N_{G_{i}}(x)$ is the number of copies of $G_{i}$ in the graph $x$ and $\lvert V_{i}\rvert$ is the number of vertices of $G_{i}=(V_{i},E_{i})$ . For details, see CD13 or SS18 . One can think of the ERGM as an extension of the famous Erdös–Rényi model (which corresponds to the choice $s=1$ ) to account for dependencies between the edges.

By way of example we show concentration properties of the number of triangles $T_{3}(X)=\sum_{\{e,f,g\}\in\mathcal{T}_{3}}X_{e}X_{f}X_{g}$ (where $\mathcal{T}_{3}$ denotes the set of all three edges forming a triangle). To formulate our results, we need to recall the function $\Phi_{\operatorname{\bm{\beta}}}(x)=\sum_{i=1}^{s}\beta_{i}\lvert E_{i}\rvert x^{\lvert E_{i}\rvert-1}$ which frequently appears in the discussion of the ERGM. Moreover, we set $\lvert\operatorname{\bm{\beta}}\rvert\coloneqq(\lvert\beta_{1}\rvert,\ldots,\lvert\beta_{s}\rvert)$ . In the following corollary, the condition $\frac{1}{2}\Phi_{\lvert\operatorname{\bm{\beta}}\rvert}^{\prime}(1)<1$ ensures weak dependence in the sense that a $\mathfrak{d}$ –LSI holds. As outlined above, in comparison to earlier results like (SS18, , Theorem 3.2), using Theorem 2.2 yields sharper tail estimates.

Corollary 2

Let $X$ be an exponential random graph model with parameter $\operatorname{\bm{\beta}}=(\beta_{1},\ldots,\beta_{s},G_{1},\ldots,G_{s})$ such that $\frac{1}{2}\Phi_{\lvert\operatorname{\bm{\beta}}\rvert}^{\prime}(1)<1$ . There is a constant $C(\operatorname{\bm{\beta}})$ such that for all $t\geq 0$

[TABLE]

3 Concentration inequalities under logarithmic Sobolev inequalities: Proofs

In this section, we give the proofs of our main results. All of them work by first establishing a growth rate on the $L^{p}$ norms of $f-\operatorname{\mathbb{E}}f$ which will then be iterated. For technical reasons, we need to introduce some auxiliary difference operators which are closely related to $\mathfrak{h}$ . For $i\in[n]$ let

[TABLE]

where $\lVert f\rVert_{X_{i}^{\prime},\infty}$ shall denote the $L^{\infty}$ norm with respect to $X_{i}^{\prime}$ .

The $L^{p}$ norm inequalities which form the core of our proofs can be found in (BGS18, , Theorem 2.3, Corollary 2.6) (building upon the earlier results in BBLM05 ). Note that as compared to BGS18 , a different choice of normalization for $\mathfrak{h}^{\pm}$ leads to slightly different constants.

Theorem 3.1

If $X_{1},\ldots,X_{n}$ are independent random variables and $f=f(X)\in L^{\infty}(\operatorname{\mathbb{P}})$ , with the constant $\kappa=\frac{\sqrt{e}}{2\,(\sqrt{e}-1)}$ , we have for any $p\geq 2$ ,

[TABLE]

Consequently, this leads to

[TABLE]

Furthermore, we need an auxiliary statement relating differences of consecutive order. In BGS18 , we have proven that $\lvert\mathfrak{h}\lvert\mathfrak{h}^{(d)}f\rvert_{\mathrm{HS}}\rvert\leq\lvert\mathfrak{h}^{(d+1)}f\rvert_{\mathrm{HS}}$ . Moreover, we explained that a similar estimate with the Hilbert–Schmidt replaced by operator norms cannot be true. As we will see next, the key step in order to be able to invoke operator norms nevertheless is to work with $\mathfrak{h}^{+}$ .

Here we need the following simple but crucial observation: if $A$ is a $d$ -tensor, the supremum in the definition of $\lvert A\rvert_{\mathrm{op}}$ is attained, and if $A$ is a non-negative tensor (i. e. $A_{i_{1}\ldots i_{d}}\geq 0$ for all $i_{1},\ldots,i_{d}$ ), the maximizing vectors $\widetilde{v}^{1},\ldots,\widetilde{v}^{d}$ can be chosen to have all positive entries. Indeed, since $\widetilde{v}^{1}_{i_{1}}\cdots\widetilde{v}^{d}_{i_{d}}\leq\lvert\widetilde{v}^{1}_{i_{1}}\cdots\widetilde{v}^{d}_{i_{d}}\rvert$ , we can define $\lvert\widetilde{v}\rvert^{j}$ by taking the absolute value element-wise.

Lemma 1

For any $d\geq 2$

[TABLE]

Proof

We have

[TABLE]

Here, in the first inequality we insert the vectors $\widetilde{v}^{1},\ldots,\widetilde{v}^{d-1}$ maximizing the supremum and use the monotonicity of $x\mapsto x_{+}$ , and the second and third inequality follow from the triangle inequality. Taking the square root yields the claim.

As a final step, we need to establish a connection between $L^{p}$ norm estimates and multilevel concentration inequalities. This is given by the following proposition, which was proven in (Ad06, , Theorem 7) and (AW15, , Theorem 3.3). We state it in the form given in (SS18, , Proof of Theorem 3.6) with slight modifications.

Proposition 4

Assume that a random variable $f$ satisfies for any $p\geq 2$ and some constants $C_{1},\ldots,C_{d}\geq 0$ $\lVert f-\operatorname{\mathbb{E}}f\rVert_{p}\leq\sum_{k=1}^{d}C_{k}(p-s)^{k/2}$ for some $s\in[0,2)$ , and let $L\coloneqq\lvert\{l:C_{l}>0\}\rvert$ . For any $t\geq 0$ we have

[TABLE]

We will not give a proof of Proposition 4 and refer to the aforementioned works. However, the proof is almost identical to the proof of Proposition 2. The two important cases will be $s=0$ (for independent random variables) as well as $s=3/2$ (in the weakly dependent setting).

The proof of Theorem 1.1 is now easily completed.

Proof (Proof of Theorem 1.1)

Since $X_{1},\ldots,X_{n}$ are independent, Theorem 3.1 yields

[TABLE]

where we have used that for any positive random variable $W$

[TABLE]

The second term on the right hand side can now be estimated using Theorem 3.1 again, which in combination with Lemma 1 gives

[TABLE]

This can be easily iterated to obtain for any $d\in\operatorname{\mathbb{N}}$

[TABLE]

Now it remains to apply Proposition 4.

To prove Theorem 1.2, we shall require the following proposition, which is proven in (GSS18, , Proposition 2.4). (Note that the definition of $\mathfrak{h}$ there differed by a factor of $\sqrt{2}$ .) The estimate (20) does not appear therein, but is an easy modification of the proof.

Proposition 5

Let $\mu$ be a measure on a product of Polish spaces satisfying a $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ . Then, for any $f\in L^{\infty}(\mu)$ and any $p\geq 2$ we have

[TABLE]

and

[TABLE]

Proof (Proof of Theorem 1.2)

The proof is very similar to the proof of Theorem 1.1. In the first step, using (19) leads to

[TABLE]

Equation (20) can be used to estimate the second term on the right hand side. So, for any $d\in\operatorname{\mathbb{N}}$ we have by an iteration

[TABLE]

Again we can apply Proposition 4 to obtain the concentration inequality.

To prove Theorem 1.3 we shall need the following lemma.

Lemma 2

Let $(\mathcal{B},\lVert\cdot\rVert)$ be a Banach space and $\mathcal{F}$ a family of uniformly norm-bounded, $\mathcal{B}$ -valued, measurable functions and set $g(X)=\sup_{f\in\mathcal{F}}\lVert f(X)\rVert$ . We have

[TABLE]

Proof

Fix an $X\in\mathcal{Y}$ and choose for any $\varepsilon>0$ a function $f_{\varepsilon}$ such that $\lVert f_{\varepsilon}(X)\rVert\geq\sup_{f\in\mathcal{F}}\lVert f(X)\rVert-\varepsilon$ . This yields

[TABLE]

where the first inequality follows by monotonicity of $x\mapsto x_{+}$ and the second one is a consequence of $(a+b-c)_{+}\leq(a-c)_{+}+b$ for $a,b,c\geq 0$ . Thus we have

[TABLE]

Taking the limit $\varepsilon\to 0$ yields the claim.

Proof (Proof of Theorem 1.3)

Note that in the real-valued case, the estimate $\mathfrak{h}_{i}^{+}\lvert f\rvert\leq\mathfrak{h}_{i}f$ holds. For brevity, let $s=3/2$ . Using this in combination with Proposition 5 and Lemma 2 yields

[TABLE]

We can apply Proposition 5 again on the right hand side, which gives

[TABLE]

A combination of Lemmas 1 and 2 shows that $\lvert\mathfrak{h}^{+}W_{j}\rvert\leq W_{j+1}$ , and so by an iteration we obtain

[TABLE]

In the case of independent random variables we replace the first step using Theorem 3.1. Here, $2\sigma^{2}=2\kappa$ and $s=0$ .

Proof (Proof of Proposition 1)

The proof shares some similarities with the proof of Lemma 2. Since $X$ satisfies a $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ , we have for any $p\geq 2$

[TABLE]

Moreover, for any $i\in[n]$ and $x\in\mathcal{Y}$ , if a maximizer $\widetilde{f}$ of $\sup_{f\in\mathcal{F}}\lvert\sum_{j=1}^{n}f(x_{j})\rvert$ exists, we obtain

[TABLE]

If a maximizer $\widetilde{f}$ does not exist, these estimates remain valid by an approximation argument as in the proof of Lemma 2. Consequently, we have $\lVert(g-\operatorname{\mathbb{E}}g)_{+}\rVert_{p}\leq(2\sigma^{2}(p-3/2)n\sup_{f\in\mathcal{F}}c(f)^{2})^{1/2}.$ The claim now follows from Proposition 4.

4 Suprema of chaos, U-statistics and polynomials: Proofs

Proof (Proof of Theorem 2.1)

Let us first consider the case that $X$ satisfies a $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ . Recall that we have by (20)

[TABLE]

We shall make use of the pointwise inequality $\lvert\mathfrak{h}^{+}f\rvert\leq(b-a)W_{1}.$ To see this, let $(\widetilde{t},\widetilde{v}^{*})$ be the tuple satisfying $\sup_{t\in\mathcal{T}}\sup_{v^{*}\in\mathcal{B}_{1}^{*}}v^{*}(\sum_{I\in\mathcal{I}_{n,d}}X_{I}t_{I})=\widetilde{v}^{*}(\sum_{I\in\mathcal{I}_{n,d}}X_{I}\widetilde{t}_{I})$ . We have

[TABLE]

proving the first part. Consequently,

[TABLE]

As in BBLM05 , this can now be iterated, i. e. we have for any $k\in\{1,\ldots,d-1\}$ $\lvert\mathfrak{h}^{+}W_{k}\rvert\leq(b-a)W_{k+1}$ . Here we may argue as above, where the only difference is to choose $(\widetilde{t},\widetilde{v}^{*})$ and $\widetilde{\alpha}^{(1)},\ldots,\widetilde{\alpha}^{(k)}$ which maximize $W_{k}$ . This finally leads to

[TABLE]

using that $W_{d}$ is constant. This proves (11). The same arguments are also valid without a $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ property, if one considers $\lVert(f-\operatorname{\mathbb{E}}f)_{+}\rVert_{p}$ and applies Theorem 3.1 instead.

Lastly, to prove (12), let us first consider why we cannot argue as before. Note that the argument heavily relies on the positive part of the difference operator $\mathfrak{h}^{+}$ , which allows us to choose the maximizers $t_{1},\ldots,t_{n}$ independent of $i\in[n]$ . This is no longer possible for the concentration inequality. Here, Theorem 3.1 yields

[TABLE]

Thus this argument fails if we try to use these inequalities. However, we can rewrite $\mathfrak{h}_{i}f(x)=\sup_{x_{i}^{\prime},x_{i}^{\prime\prime}}(f(x_{i^{c}},x_{i}^{\prime})-f(x_{i^{c}},x_{i}^{\prime\prime}))_{+}=\sup_{x_{i}^{\prime}}\mathfrak{h}_{i}^{+}f(x_{i^{c}},x_{i}^{\prime})$ , where the $\sup$ is to be understood with respect to the support of $X_{i}^{\prime}$ . As a consequence, we have for each fixed $i\in[n]$ (again choosing $\widetilde{t}$ by maximizing the first summand in the brackets)

[TABLE]

This implies

[TABLE]

The proof is now completed as using the same arguments as in the first part, with $W_{k}$ replaced by $\widetilde{W}_{k}$ . The same argument is valid for $X$ satisfying a $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ .

Proof (Proof of Proposition 2)

The proposition can be proven using a similar technique as before, since the Hilbert–Schmidt norms of higher order difference act as Fourier projections. We choose to take an alternate route as follows. The proof of (OD14, , Theorem 9.21) shows that for any $f$ with degree at most $d$ and any $p\geq 2$

[TABLE]

First off, by Chebyshev’s inequality we have for any $p\geq 1$

[TABLE]

We want to apply this to a $t$ -dependent parameter $p$ given by the function

[TABLE]

If $\eta_{f}(t)\geq 2$ , (21) yields $e\lVert f(X)-\operatorname{\mathbb{E}}f(X)\rVert_{\eta_{f}(t)}\leq t$ , which combined with the trivial estimate $\operatorname{\mathbb{P}}(\cdot)\leq 1$ gives

[TABLE]

as claimed.

Proof (Proof of Proposition 3)

We apply Theorems 1.1 and 1.2 in the respective cases. To this end, we make use of the general bound $\lVert\mathfrak{h}^{(k)}f\rVert_{\mathrm{op},1}\leq\lVert\mathfrak{h}^{(k)}f\rVert_{\mathrm{HS},\infty}$ for $k\in[d]$ . For any distinct $j_{1},\ldots,j_{k}$ write $\lVert\cdot\rVert=\lVert\cdot\rVert_{j_{1},\ldots,j_{k},\infty}$ , so that

[TABLE]

Now it is easy to see that $S_{i_{1},\ldots,i_{d}}(h,X)=0$ unless $\{j_{1},\ldots,j_{k}\}\subset\{i_{1},\ldots,i_{d}\}$ (for example, this follows if one writes the sum inside the norm as $\prod_{i=1}^{k}(\mathrm{Id}-T_{j_{i}})f$ ), and in these cases one can upper bound the supremum by $2^{k}B$ , from which we infer

[TABLE]

Consequently, this leads to

[TABLE]

Thus, an application of Theorem 1.1 or 1.2 respectively yields for any $t\geq 0$ and for $C$ as given therein

[TABLE]

For the second part, choose $t=Bn^{d-1/2}\widetilde{t}$ for $\widetilde{t}>0$ to obtain

[TABLE]

A short calculation shows that the minimum is attained for $k=1$ in the range $t\leq n^{1/2}$ and for $k=d$ otherwise, i. e.

[TABLE]

Proof (Proof of Theorem 2.2)

We give a sketch of the proof only and refer to (AKPS18, , Proof of Theorem 2.2) for details. Recall that by (19) we have the inequality

[TABLE]

Using the arguments and notations from (AKPS18, , Proof of Theorem 2.2) leads to

[TABLE]

where $M$ is an absolute constant and $G_{i}$ is a sequence of independent standard Gaussian random variables, independent of $X$ . Furthermore, a result by Latała La06 yields

[TABLE]

The rest now follows as in the previous proofs.

Proof (Proof of Corollary 2)

In SS18 the authors have proven that $\frac{1}{2}\Phi_{\lvert\operatorname{\bm{\beta}}\rvert}^{\prime}(1)<1$ implies a $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ for $\mu_{\operatorname{\bm{\beta}}}$ with a constant depending on the parameter $\operatorname{\bm{\beta}}$ only. Thus, it remains to bound the norms in (17). Note that due to the structure of the exponential random graph model, the expectations of $\operatorname{\mathbb{E}}X_{G}$ and $\operatorname{\mathbb{E}}X_{H}$ are equal whenever $G$ and $H$ are isomorphic. Thus, we define $C_{S_{2}}\coloneqq\operatorname{\mathbb{E}}X_{S_{2}}$ (where $S_{2}$ is a $2$ -star) and $C_{E}=\operatorname{\mathbb{E}}X_{e}$ .

The Euclidean norms can be easily bounded:

[TABLE]

and it remains to estimate the three remaining norms. However, in (AW15, , Section 5.1), the authors given estimates for such norms in the Erdös–Rényi case, and it is easy to adapt these to any model with the property that $\operatorname{\mathbb{E}}X_{G}$ depends only on the isomorphism class of $G$ (in the complete graph). Especially, due to the structure of the exponential random graph models, this is true in this setting as well. This gives

[TABLE]

Inserting these estimates into (17) finishes the proof.

5 Logarithmic Sobolev inequalities and difference operators

To conclude this paper, we discuss the LSI property (2) for different choices of difference operators $\Gamma$ . Here, we always assume that the probability measure $\mu$ is defined on a product of Polish spaces $\mathcal{Y}=\otimes_{i=1}^{n}\mathcal{X}_{i}$ with product Borel $\sigma$ -algebra $\mathcal{A}=\mathcal{B}(\otimes_{i=1}^{n}\mathcal{X}_{i})$ .

In this situation, we can make use of the disintegration theorem on Polish spaces (see (DM78, , Chapter III) and (AGS08, , Theorem 5.3.1)): If $\mu$ is a measure on $\mathcal{Y}$ , then for each $i\in\{1,\ldots,n\}$ we can decompose $\mu$ using the marginal measure $\mu_{i^{c}}$ (as a measure on $\otimes_{j\neq i}\mathcal{X}_{i}$ ) and a conditional measure on $\mathcal{X}_{i}$ , which we denote by $\mu(\cdot\mid x_{i^{c}})$ . More precisely, for any $A\in\mathcal{A}$ we have $\mu(A)=\int_{\otimes_{j\neq i}\mathcal{X}_{i}}\int_{\mathcal{X}_{i}}\text{$ \mathbbm{1} $}_{A}(x_{i^{c}},x_{i})d\mu(x_{i}\mid x_{i^{c}})d\mu_{i^{c}}(x_{i^{c}})$ .

For finite spaces, $\mu(\cdot\mid x_{i^{c}})$ is just the ordinary conditional measure as used in the definition of the difference operator $\mathfrak{d}$ . Note that the definition of $\mathfrak{d}$ can in principle be rewritten for products of arbitrary Polish spaces. However, our first result shows that the $\mathfrak{d}$ -LSI property in fact requires the underlying space to be finite. More precisely, we say that $\mu$ has finite support if there is no sequence of sets $A_{n}\in\mathcal{A}$ with $\mu(A_{n})>0$ for any $n$ and $\mu(A_{n})\to 0$ .

Proposition 6

Let $\mathcal{Y}=\otimes_{i=1}^{n}\mathcal{X}_{i}$ be a product of Polish spaces, and let $\mu$ be a probability measure on $\mathcal{Y}$ . If $\mu$ satisfies a $\mathfrak{d}$ -LSI, then $\mu$ has finite support. Moreover, if $\mu$ is a product probability measure, then $\mu$ satisfies a $\mathfrak{d}$ -LSI iff $\mu$ has finite support.

Proof

First assume $\mu$ does not have finite support, i. e. there is a sequence $A_{n}\in\mathcal{A}$ with $\mu(A_{n})\to 0$ . Choosing $f_{n}\coloneqq\text{$ \mathbbm{1} $}_{A_{n}}\in L^{\infty}(\mu)$ and assuming a $\mathfrak{d}$ -LSI $(\sigma^{2})$ holds, we obtain

[TABLE]

This easily leads to a contradiction.

On the other hand, let $\mu$ be a product probability measure with finite support. By tensorization, it suffices to consider $n=1$ , and we may moreover assume $\mathcal{Y}$ to have finitely many elements only. Then, by (BT06, , Remark 6.6), $\mu$ satisfies a $\mathfrak{d}$ -LSI $(\sigma^{2})$ with $\sigma^{2}\leq C\log(1/\min_{y:\mu(y)>0}\mu(y))$ , which finishes the proof.

In fact, Proposition 6 can be adapted to the difference operator $\mathfrak{h}^{+}$ as well. To see this, note that that (23) can easily be rewritten for the difference operator $\mathfrak{h}^{+}$ (with only minor changes) and $\int\lvert\mathfrak{d}f\rvert^{2}d\mu\leq\int\lvert\mathfrak{h}^{+}f\rvert^{2}d\mu$ . In particular, the $\mathfrak{d}$ - and $\mathfrak{h}^{+}$ -LSI properties are not essentially different.

The situation drastically changes if we consider $\mathfrak{h}$ -LSIs instead. Here, a sufficient condition for the $\mathfrak{h}\mathrm{-LSI}$ property to hold is that the measure $\mu$ satisfies an approximate tensorization (AT) property. As a consequence, for product probability measures, satisfying an $\mathfrak{h}$ -LSI is in fact a universal property.

Theorem 5.1

Let $\mathcal{Y}=\otimes_{i=1}^{n}\mathcal{X}_{i}$ be a product of Polish spaces, and let $\mu$ be a probability measure on $\mathcal{Y}$ . If $\mu$ satisfies an approximate tensorization property

[TABLE]

then $\mu$ also satisfies an $\mathfrak{h}\mathrm{-LSI}(C)$ . In particular, any product probability measure satisfies an $\mathfrak{h}\mathrm{-LSI}(1)$ .

To the best of our knowledge, Theorem 5.1 is new. For product measures, it might be compared to the Efron–Stein inequality (see e. g. ES81 ; St86 ) which establishes the tensorization property for the variance, and can be regarded as a universal Poincaré inequality with respect to $\mathfrak{d}$ (see e. g. BGS18 for such an interpretation). However, note that Theorem 5.1 (i. e. more precisely the $\mathfrak{h}\mathrm{-LSI}(1)$ for product measures) does not imply the Efron–Stein inequality, as the difference operator is $\mathfrak{h}$ instead of $\mathfrak{d}$ . Unfortunately, as Proposition 6 demonstrates, there is no “entropy version” of the Efron–Stein inequality of the form $\operatorname{Ent}_{\mu}(f^{2})\leq C\operatorname{\mathbb{E}}_{\mu}\lvert\mathfrak{d}f\rvert^{2}$ (for any product probability measure $\mu$ and some universal constant $C$ ).

As by Theorem 5.1, any set of independent random variables $X_{1},\ldots,X_{n}$ satisfies an $\mathfrak{h}$ -LSI $(1)$ , it might be tempting to regard Theorem 1.1 as an $\mathfrak{h}$ -LSI analogue of Theorem 1.2. However, it seems that it is not possible to use the entropy method based on $\mathfrak{h}$ -LSIs, so that this interpretation is not fully accurate. More precisely, Theorem 5.1 cannot be used to estimate the growth of $L^{p}$ norms as in the setting of a $\mathfrak{d}\mathrm{-LSI}(\sigma^{2})$ . Indeed, it is impossible to prove the required moment inequalities

[TABLE]

under an $\mathfrak{h}\mathrm{-LSI}(\sigma^{2})$ . For example, the measure $\mu_{p}=p\delta_{1}+(1-p)\delta_{0}$ satisfies $\mathfrak{h}\mathrm{-LSI}(\sigma^{2}_{p})$ with $\sigma_{p}^{2}\sim p(1-p)\log(1/p)$ (for $p\to 0$ ), so that (25) would imply for $f(x)=x$ an upper bound on the Orlicz norm associated to $\Psi_{2}(x)=e^{x^{2}}-1$

[TABLE]

However, a simple calculation shows that $\operatorname{\mathbb{E}}\exp\big{(}\frac{(f-\operatorname{\mathbb{E}}f)^{2}}{16e^{2}\sigma_{p}^{2}}\big{)}\to\infty$ as $p\to 0$ .

The approximate tensorization property in Theorem 5.1 is interesting in its own right, but it is not yet well-studied. For finite spaces Ma15 gives sufficient conditions for a measure $\mu$ to satisfy an approximate tensorization property. Similar results have been derived in CMT15 , which can be applied in discrete and continuous settings. For example, if one considers a measure of the form

[TABLE]

for some countable spaces $\Omega_{i}$ , $x_{i}\in\Omega_{i}$ , measures $\mu_{0,i}$ on $\Omega_{i}$ and bounded functions $w_{ij}$ , under certain technical conditions $\mu$ satisfies an approximate tensorization property. This does not require any functional inequality for $\mu_{0,i}$ . Very recently, in (AKPS18, , Proposition 5.4) it has been shown that the $\mathrm{AT}(C)$ property implies dimension-free concentration inequalities for convex functions.

Note that the $\mathrm{AT}(C)$ property requires a certain weak dependence assumption in general. For example, the push-forward of a random permutation $\pi$ of $[n]$ to $\operatorname{\mathbb{N}}^{n}$ cannot satisfy an approximate tensorization property. It is an interesting question to find necessary and sufficient conditions for the approximate tensorization property to hold.

Proof (Proof of Theorem 5.1)

Let $X=(X_{1},\ldots,X_{n})$ be a $\mathcal{Y}$ -valued random vector with law $\mu$ . First we consider the case $n=1$ . By homogeneity of both sides, we may assume $\int f^{2}(X)d\operatorname{\mathbb{P}}=1$ . Since $f$ is bounded, we have $0\leq a\leq\lvert f(X)\rvert\leq b<\infty$ $\operatorname{\mathbb{P}}$ -a.s., where $b$ is the essential supremum of $\lvert f(X)\rvert$ and $a$ the essential infimum. Due to the constraints on the integral this leads to $a^{2}\leq 1\leq b^{2}$ . (Actually the cases $b=1$ or $a=1$ are trivial, since then $f^{2}(X)=1$ $\operatorname{\mathbb{P}}$ -a.s., but we will not make this distinction.) Let $F(u)\coloneqq\operatorname{\mathbb{P}}(f^{2}(X)\geq u)$ . In particular

[TABLE]

Using the partial integration formula (see e. g. (HS75, , Theorem 21.67 and Remark 21.68)) in connection with (Bu07, , Theorem 7.7.1) yields

[TABLE]

The first integral can be calculated explicitly

[TABLE]

and moreover we have due to $\log(u)\leq\log(b^{2})$ on $[a^{2},b^{2}]$

[TABLE]

Plugging in these two estimates yields

[TABLE]

Next, if we show that

[TABLE]

we can further estimate (as $\lvert\mathfrak{h}f\rvert^{2}$ is a deterministic quantity in the case $n=1$ )

[TABLE]

To prove (26), define

[TABLE]

Now it is easy to see that $g(a,1)=a^{2}\log a^{2}+(1-a^{2})-2(1-a)^{2}\leq 0,$ since $\partial_{a}g(a,1)\geq 0$ for $a\in[0,1]$ and $g(1,1)=0$ . Moreover

[TABLE]

so that $g$ is decreasing on every strip $\{a_{0}\}\times[1,\infty)$ , and thus $g(a,b)\leq 0$ for all $a,b\in G$ . This finishes the proof for $n=1$ .

For arbitrary $n$ , the proof is now easily completed. Assume that $f\in L^{\infty}(\mu)$ , i. e. $\mu_{i^{c}}(x_{i^{c}})$ -a.s. we have $f(x_{i^{c}},\cdot)\in L^{\infty}(\mu(\cdot\mid x_{i^{c}}))$ . For these $x_{i^{c}}$ , by the $n=1$ case we therefore obtain

[TABLE]

Plugging this into the assumption leads to

[TABLE]

As for the second part, it is a classical fact that independent random variables satisfy the tensorization property (i. e. $\mathrm{AT}(1)$ ), see for example (Led01, , Proposition 5.6), (BBLM05, , Theorem 4.10) or (vH16, , Theorem 3.14). In the case of independent random variables, the assumption that $\mathcal{Y}$ is a product of Polish spaces can be dropped by simply defining $\mu(\cdot\mid x_{i^{c}})\coloneqq\mu_{i}=\operatorname{\mathbb{P}}\circ X_{i}$ .

Bibliography57

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Adamczak, R.: Moment inequalities for U 𝑈 U -statistics. Ann. Probab. 34 (6), 2288–2314 (2006). DOI 10.1214/009117906000000476
2(2) Adamczak, R.: A note on the Hanson-Wright inequality for random vectors with dependencies. Electron. Commun. Probab. 20 , no. 72, 13 (2015). DOI 10.1214/ECP.v 20-3829
3(3) Adamczak, R., Kotowski, M., Polaczyk, B., Strzelecki, M.: A note on concentration for polynomials in the Ising model. ar Xiv preprint (2018)
4(4) Adamczak, R., Latała, R., Meller, R.: Hanson–Wright inequality in Banach spaces. ar Xiv preprint (2018)
5(5) Adamczak, R., Wolff, P.: Concentration inequalities for non-Lipschitz functions with bounded derivatives of higher order. Probab. Theory Related Fields 162 (3-4), 531–586 (2015). DOI 10.1007/s 00440-014-0579-3
6(6) Aida, S., Stroock, D.W.: Moment estimates derived from Poincaré and logarithmic Sobolev inequalities. Math. Res. Lett. 1 (1), 75–86 (1994). DOI 10.4310/MRL.1994.v 1.n 1.a 9
7(7) Ambrosio, L., Gigli, N., Savaré, G.: Gradient flows in metric spaces and in the space of probability measures, second edn. Lectures in Mathematics ETH Zürich. Birkhäuser Verlag, Basel (2008)
8(8) Arcones, M.A., Giné, E.: On decoupling, series expansions, and tail behavior of chaos processes. J. Theoret. Probab. 6 (1), 101–122 (1993). DOI 10.1007/BF 01046771

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Abstract

MSC:

1 Introduction

1.1 Main results

Theorem 1.1

Theorem 1.2

Theorem 1.3

Proposition 1

1.2 Outline

2 Applications

2.1 Uniform bounds

Theorem 2.1

Corollary 1

2.2 The Boolean hypercube

Proposition 2

2.3 Concentration properties of UUU-statistics

Proposition 3

2.4 Polynomials and subgraph counts in exponential random graph models

Theorem 2.2

Corollary 2

3 Concentration inequalities under logarithmic Sobolev inequalities: Proofs

Theorem 3.1

Lemma 1

Proof

Proposition 4

Proof (Proof of Theorem 1.1)

Proposition 5

Proof (Proof of Theorem 1.2)

Lemma 2

Proof

Proof (Proof of Theorem 1.3)

Proof (Proof of Proposition 1)

4 Suprema of chaos, U-statistics and polynomials: Proofs

Proof (Proof of Theorem 2.1)

Proof (Proof of Proposition 2)

Proof (Proof of Proposition 3)

Proof (Proof of Theorem 2.2)

Proof (Proof of Corollary 2)

5 Logarithmic Sobolev inequalities and difference operators

Proposition 6

Proof

Theorem 5.1

Proof (Proof of Theorem 5.1)

2.3 Concentration properties of $U$ -statistics