Relaxing the Assumptions of Knockoffs by Conditioning

Dongming Huang; Lucas Janson

arXiv:1903.02806·stat.ME·June 16, 2020

Relaxing the Assumptions of Knockoffs by Conditioning

Dongming Huang, Lucas Janson

PDF

1 Repo

TL;DR

This paper extends the model-X knockoffs method by relaxing the assumption of knowing the exact covariate distribution, instead allowing for a parametric model with many parameters, while maintaining false discovery rate control.

Contribution

It shows that knockoffs guarantees hold when the covariate distribution is known only up to a parametric model, using conditioning on sufficient statistics.

Findings

01

Maintains FDR control under weaker assumptions.

02

Effective in Gaussian models with conditioning on sufficient statistics.

03

Simulations demonstrate robustness of the new approach.

Abstract

The recent paper Cand\`es et al. (2018) introduced model-X knockoffs, a method for variable selection that provably and non-asymptotically controls the false discovery rate with no restrictions or assumptions on the dimensionality of the data or the conditional distribution of the response given the covariates. The one requirement for the procedure is that the covariate samples are drawn independently and identically from a precisely-known (but arbitrary) distribution. The present paper shows that the exact same guarantees can be made without knowing the covariate distribution fully, but instead knowing it only up to a parametric model with as many as $Ω (n^{*} p)$ parameters, where $p$ is the dimension and $n^{*}$ is the number of covariate samples (which may exceed the usual sample size $n$ of labeled samples when unlabeled samples are also available). The key is to treat the…

Figures39

Click any figure to enlarge with its caption.

a of the paper, where the response is drawn from $Y_{i}\mid X_{i}\sim N(X_{i}^{\top}\bm{\beta}/\sqrt{n},1)$. The result is shown in Figure [12](#A4.F12).

a but with varying sparsities and magnitudes. Specifically, the sparsity level $k$ varies between $30$, $60$, and $90$, and the nonzero entries are randomly sampled from Unif$(1,2)$. The message from these experiments is the same as those in the main paper, that is, the power of conditional knockoffs is almost the same as that of unconditional knockoffs even though it does not know the exact distribution of $\bm{X}$.

Tables1

Table 1. Table 1: Maximum complexity of models allowed by existing methods (see Section 1.3 ) and our proposal (see the list in Section 2.2 and also Section 2.3 for the explanation for Ω ( n ∗ p ) Ω superscript 𝑛 𝑝 \Omega(n^{*}p) ) for controlled variable selection. Note that without assuming a model, F Y | X subscript 𝐹 conditional 𝑌 𝑋 F_{Y|X} and F X subscript 𝐹 𝑋 F_{X} are of similar complexity (exponentially large in p 𝑝 p ).

	Model for $F_{Y \| X}$	Model for $F_{X}$
Fixed-X	$o (n)$ parameters³³3In the exceptional case of Gaussian linear regression, $n$ parameters are allowed.^,⁴⁴4Except for Gaussian linear regression, fixed-X inferential guarantees are only asymptotic.	arbitrary
Model-X (Candès et al.,, 2018)	arbitrary	$0$ parameters
Model-X (this paper)	arbitrary	$Ω (n^{*} p)$ parameters

Equations459

Y\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{j}\mid X_{{\text{-}j}},

Y\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{j}\mid X_{{\text{-}j}},

\text{FDR}:=\mbox{$\mathbb{E}\left[\frac{|\hat{S}\,\cap\,\mathcal{H}_{0}|}{|\hat{S}|}\right]$},

\text{FDR}:=\mbox{$\mathbb{E}\left[\frac{|\hat{S}\,\cap\,\mathcal{H}_{0}|}{|\hat{S}|}\right]$},

\tilde{X}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y\,|\,X,\mbox{ and }(X,\tilde{X})_{\text{swap}(A)}\,{\buildrel\mathcal{D}\over{=}}\,(X,\tilde{X}),

\tilde{X}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y\,|\,X,\mbox{ and }(X,\tilde{X})_{\text{swap}(A)}\,{\buildrel\mathcal{D}\over{=}}\,(X,\tilde{X}),

(Z_{1}, \dots, Z_{p}, \tilde{Z}_{1}, \dots, \tilde{Z}_{p}) = z ([X, \tilde{X}], y),

(Z_{1}, \dots, Z_{p}, \tilde{Z}_{1}, \dots, \tilde{Z}_{p}) = z ([X, \tilde{X}], y),

z ([X, \tilde{X}]_{swap (A)}, y) = z ([X, \tilde{X}], y)_{swap (A)} .

z ([X, \tilde{X}]_{swap (A)}, y) = z ([X, \tilde{X}], y)_{swap (A)} .

w_{j}([\bm{X},\tilde{\bm{X}}]_{\text{swap}(A)},\bm{y})=\left\{\begin{array}[]{c l}w_{j}([\bm{X},\tilde{\bm{X}}],\bm{y}),&\text{ if }\;j\notin A,\\ -w_{j}([\bm{X},\tilde{\bm{X}}],\bm{y}),&\text{ if }\;j\in A.\end{array}\right.

w_{j}([\bm{X},\tilde{\bm{X}}]_{\text{swap}(A)},\bm{y})=\left\{\begin{array}[]{c l}w_{j}([\bm{X},\tilde{\bm{X}}],\bm{y}),&\text{ if }\;j\notin A,\\ -w_{j}([\bm{X},\tilde{\bm{X}}],\bm{y}),&\text{ if }\;j\in A.\end{array}\right.

T_{0} = min {t > 0 : \frac{# { j : W _{j} \leq - t }}{# { j : W _{j} \geq t }} \leq q}, T_{+} = min {t > 0 : \frac{1 + # { j : W _{j} \leq - t }}{# { j : W _{j} \geq t }} \leq q} .

T_{0} = min {t > 0 : \frac{# { j : W _{j} \leq - t }}{# { j : W _{j} \geq t }} \leq q}, T_{+} = min {t > 0 : \frac{1 + # { j : W _{j} \leq - t }}{# { j : W _{j} \geq t }} \leq q} .

([X, \tilde{X}]_{swap (A)}, y) \frac{\buildrel D}{=} ([X, \tilde{X}], y),

([X, \tilde{X}]_{swap (A)}, y) \frac{\buildrel D}{=} ([X, \tilde{X}], y),

\tilde{\bm{X}}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{y}\,|\,\bm{X},\;\text{ and }[\bm{X},\,\tilde{\bm{X}}]_{\text{swap}(A)}\,{\buildrel\mathcal{D}\over{=}}\,[\bm{X},\,\tilde{\bm{X}}],

\tilde{\bm{X}}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{y}\,|\,\bm{X},\;\text{ and }[\bm{X},\,\tilde{\bm{X}}]_{\text{swap}(A)}\,{\buildrel\mathcal{D}\over{=}}\,[\bm{X},\,\tilde{\bm{X}}],

\mbox{$\mathbb{E}\left[\frac{|\hat{S}\,\cap\,\mathcal{H}_{0}|}{\max\left(|\hat{S}|,1\right)}\right]$}\leq q,\mbox{ for $T_{+}$ };~{}~{}\mbox{$\mathbb{E}\left[\frac{|\hat{S}\,\cap\,\mathcal{H}_{0}|}{|\hat{S}|+1/q}\right]$}\leq q,\mbox{ for $T_{0}$ . }

\mbox{$\mathbb{E}\left[\frac{|\hat{S}\,\cap\,\mathcal{H}_{0}|}{\max\left(|\hat{S}|,1\right)}\right]$}\leq q,\mbox{ for $T_{+}$ };~{}~{}\mbox{$\mathbb{E}\left[\frac{|\hat{S}\,\cap\,\mathcal{H}_{0}|}{|\hat{S}|+1/q}\right]$}\leq q,\mbox{ for $T_{0}$ . }

\tilde{\bm{X}}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{y}\,|\,\bm{X},\;\text{ and }\left.[\bm{X},\,\tilde{\bm{X}}]_{\text{swap}(A)}\,{\buildrel\mathcal{D}\over{=}}\,[\bm{X},\,\tilde{\bm{X}}]\;\right|\,T(\bm{X}),

\tilde{\bm{X}}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{y}\,|\,\bm{X},\;\text{ and }\left.[\bm{X},\,\tilde{\bm{X}}]_{\text{swap}(A)}\,{\buildrel\mathcal{D}\over{=}}\,[\bm{X},\,\tilde{\bm{X}}]\;\right|\,T(\bm{X}),

F_{X} \in {N (μ, Σ) : μ \in R^{p}, Σ \in R^{p \times p}, Σ ≻ 0},

F_{X} \in {N (μ, Σ) : μ \in R^{p}, Σ \in R^{p \times p}, Σ ≻ 0},

F_{X} \in {N (μ, Σ) : μ \in R^{p}, Σ \in R^{p \times p}, Σ ≻ 0, (Σ^{- 1})_{j, k} = 0 for all (j, k) \in / E}

F_{X} \in {N (μ, Σ) : μ \in R^{p}, Σ \in R^{p \times p}, Σ ≻ 0, (Σ^{- 1})_{j, k} = 0 for all (j, k) \in / E}

F_{X}\in\left\{\text{distribution on }\prod_{j=1}^{p}[K_{j}]:X_{j}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{[p]\setminus N_{E}(j)}\mid X_{N_{E}(j)\setminus\{j\}}\text{ for all }(j,k)\notin E\right\}

F_{X}\in\left\{\text{distribution on }\prod_{j=1}^{p}[K_{j}]:X_{j}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{[p]\setminus N_{E}(j)}\mid X_{N_{E}(j)\setminus\{j\}}\text{ for all }(j,k)\notin E\right\}

\tilde{\bm{X}}^{*}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{y}\,|\,\bm{X}^{*},\text{ and }[\bm{X}^{*},\,\tilde{\bm{X}}^{*}]_{\text{\emph{swap}}(A)}\,{\buildrel\mathcal{D}\over{=}}\,[\bm{X}^{*},\,\tilde{\bm{X}}^{*}]\,\Big{|}\,T(\bm{X}^{*}),

\tilde{\bm{X}}^{*}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{y}\,|\,\bm{X}^{*},\text{ and }[\bm{X}^{*},\,\tilde{\bm{X}}^{*}]_{\text{\emph{swap}}(A)}\,{\buildrel\mathcal{D}\over{=}}\,[\bm{X}^{*},\,\tilde{\bm{X}}^{*}]\,\Big{|}\,T(\bm{X}^{*}),

x_{i} \sim i.i.d. N (μ, Σ)

x_{i} \sim i.i.d. N (μ, Σ)

\tilde{\bm{X}}=\bm{1}_{n}\hat{\bm{\mu}}^{\top}+(\bm{X}-\bm{1}_{n}\hat{\bm{\mu}}^{\top})(\bm{I}_{p}-\hat{\bm{\Sigma}}^{-1}\mbox{$\mathrm{diag}\left\{\bm{s}\right\}$})+\bm{U}\bm{L}.

\tilde{\bm{X}}=\bm{1}_{n}\hat{\bm{\mu}}^{\top}+(\bm{X}-\bm{1}_{n}\hat{\bm{\mu}}^{\top})(\bm{I}_{p}-\hat{\bm{\Sigma}}^{-1}\mbox{$\mathrm{diag}\left\{\bm{s}\right\}$})+\bm{U}\bm{L}.

{N (μ, Σ) : μ \in R^{p}, (Σ^{- 1})_{j, k} = 0 for all j \neq = k and (j, k) \in / E, Σ ≻ 0}

{N (μ, Σ) : μ \in R^{p}, (Σ^{- 1})_{j, k} = 0 for all j \neq = k and (j, k) \in / E, Σ ≻ 0}

2∣ V_{k} ∣ + ∣ I_{V_{k}} \cap B ∣ < n,

2∣ V_{k} ∣ + ∣ I_{V_{k}} \cap B ∣ < n,

F_{X} \in {distribution on j = 1 \prod p [K_{j}] satisfying the local Markov property w.r.t. G} .

F_{X} \in {distribution on j = 1 \prod p [K_{j}] satisfying the local Markov property w.r.t. G} .

\mbox{$\mathbb{P}\left(X_{B^{c}}\left|\ X_{B}\right.\right)$}\,=\,\prod_{j\in B^{c}}\mbox{$\mathbb{P}\left(X_{j}\left|\ X_{B}\right.\right)$}\,=\,\prod_{j\in B^{c}}\mbox{$\mathbb{P}\left(X_{j}\left|\ X_{I_{j}}\right.\right)$},

\mbox{$\mathbb{P}\left(X_{B^{c}}\left|\ X_{B}\right.\right)$}\,=\,\prod_{j\in B^{c}}\mbox{$\mathbb{P}\left(X_{j}\left|\ X_{B}\right.\right)$}\,=\,\prod_{j\in B^{c}}\mbox{$\mathbb{P}\left(X_{j}\left|\ X_{I_{j}}\right.\right)$},

\prod_{k_{j}\in[{K}_{j}],\bm{k}_{I_{j}}\in[\bm{K}_{I_{j}}]}\theta_{j}(k_{j},\bm{k}_{I_{j}})^{\mbox{$\mathbf{1}_{\left\{X_{j}=k_{j},X_{I_{j}}=\bm{k}_{I_{j}}\right\}}$}},

\prod_{k_{j}\in[{K}_{j}],\bm{k}_{I_{j}}\in[\bm{K}_{I_{j}}]}\theta_{j}(k_{j},\bm{k}_{I_{j}})^{\mbox{$\mathbf{1}_{\left\{X_{j}=k_{j},X_{I_{j}}=\bm{k}_{I_{j}}\right\}}$}},

i = 1 \prod n ψ_{B} (X_{i, B}) j \in B^{c} \prod k_{j} \in [K_{j}], k_{I_{j}} \in [K_{I_{j}}] \prod θ_{j} (k_{j}, k_{I_{j}})^{N_{j} (k_{j}, k_{I_{j}})},

i = 1 \prod n ψ_{B} (X_{i, B}) j \in B^{c} \prod k_{j} \in [K_{j}], k_{I_{j}} \in [K_{I_{j}}] \prod θ_{j} (k_{j}, k_{I_{j}})^{N_{j} (k_{j}, k_{I_{j}})},

P (X)

P (X)

{\mbox{$\mathbb{P}\left(X_{j}=0|X_{j-1}=1\right)$}=Q_{10}^{(j)}},\;\quad{\mbox{$\mathbb{P}\left(X_{j}=1|X_{j-1}=0\right)$}=Q_{01}^{(j)}}

{\mbox{$\mathbb{P}\left(X_{j}=0|X_{j-1}=1\right)$}=Q_{10}^{(j)}},\;\quad{\mbox{$\mathbb{P}\left(X_{j}=1|X_{j-1}=0\right)$}=Q_{01}^{(j)}}

Q_{10}^{(j)} = \frac{U _{1}^{(j)}}{0.4 + U _{1}^{(j)} + U _{2}^{(j)}}, Q_{01}^{(j)} = \frac{U _{3}^{(j)}}{0.4 + U _{3}^{(j)} + U _{4}^{(j)}},

Q_{10}^{(j)} = \frac{U _{1}^{(j)}}{0.4 + U _{1}^{(j)} + U _{2}^{(j)}}, Q_{01}^{(j)} = \frac{U _{3}^{(j)}}{0.4 + U _{3}^{(j)} + U _{4}^{(j)}},

\mbox{$\mathbb{P}\left(X=\bm{x}\right)$}\propto\;\exp\left(\sum_{(s,t)\in E}\theta_{s,t}x_{s}x_{t}+\sum_{s\in V}h_{s}x_{s}\right),\qquad\bm{x}\in\{-1,+1\}^{V},

\mbox{$\mathbb{P}\left(X=\bm{x}\right)$}\propto\;\exp\left(\sum_{(s,t)\in E}\theta_{s,t}x_{s}x_{t}+\sum_{s\in V}h_{s}x_{s}\right),\qquad\bm{x}\in\{-1,+1\}^{V},

\phi([\bm{X}^{*},\,\tilde{\bm{X}}^{*}]_{\text{swap}(A)})\,{\buildrel\mathcal{D}\over{=}}\,\phi([\bm{X}^{*},\,\tilde{\bm{X}}^{*}])\,\Big{|}\,T(\bm{X}^{*}),

\phi([\bm{X}^{*},\,\tilde{\bm{X}}^{*}]_{\text{swap}(A)})\,{\buildrel\mathcal{D}\over{=}}\,\phi([\bm{X}^{*},\,\tilde{\bm{X}}^{*}])\,\Big{|}\,T(\bm{X}^{*}),

[X, \tilde{X}]_{swap (A)} \frac{\buildrel D}{=} [X, \tilde{X}],

[X, \tilde{X}]_{swap (A)} \frac{\buildrel D}{=} [X, \tilde{X}],

M =

M =

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stathuang/cknockoff
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Relaxing the Assumptions of Knockoffs by Conditioning

Dongming Huang

Lucas Janson

Abstract

The recent paper Candès et al., (2018) introduced model-X knockoffs, a method for variable selection that provably and non-asymptotically controls the false discovery rate with no restrictions or assumptions on the dimensionality of the data or the conditional distribution of the response given the covariates. The one requirement for the procedure is that the covariate samples are drawn independently and identically from a precisely-known (but arbitrary) distribution. The present paper shows that the exact same guarantees can be made without knowing the covariate distribution fully, but instead knowing it only up to a parametric model with as many as $\Omega(n^{*}p)$ parameters, where $p$ is the dimension and $n^{*}$ is the number of covariate samples (which may exceed the usual sample size $n$ of labeled samples when unlabeled samples are also available). The key is to treat the covariates as if they are drawn conditionally on their observed value for a sufficient statistic of the model. Although this idea is simple, even in Gaussian models conditioning on a sufficient statistic leads to a distribution supported on a set of zero Lebesgue measure, requiring techniques from topological measure theory to establish valid algorithms. We demonstrate how to do this for three models of interest, with simulations showing the new approach remains powerful under the weaker assumptions.

Keywords. High-dimensional inference, knockoffs, model-X, sufficient statistic, false discovery rate (FDR), topological measure, graphical model

1 Introduction

1.1 Problem statement

In this paper we consider random variables $(Y,X_{1},\dots,X_{p})$ where $Y$ is a response or outcome variable, each $X_{j}$ is a potential explanatory variable (also known as a covariate or feature) and $p$ is the dimensionality, or number of covariates. For instance, $Y$ could be the binary indicator of whether a patient has a disease or not, and $X_{j}$ could be the number of minor alleles at a specific location (indexed by $j$ ) on the genome, also known as a single nucleotide polymorphism (SNP). A common question of interest is which of the $X_{j}$ are important for determining $Y$ , with importance defined in terms of conditional independence. That is, $X_{j}$ is considered unimportant (or null) if

[TABLE]

where $X_{{\text{-}j}}=\{X_{1},\dots,X_{p}\}\setminus\{X_{j}\}$ ; stated another way, $X_{j}$ is unimportant exactly when $Y$ ’s conditional distribution does not depend on $X_{j}$ . Denote by $\mathcal{H}_{0}$ the set of all $j$ such that $X_{j}$ is unimportant. As discussed in Candès et al., (2018), under very mild conditions the complement of the set of unimportant variables, i.e., the important (or non-null) variables, constitutes the Markov blanket $S$ of $Y$ , namely, the unique smallest set $S$ such that $Y\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{S}\mid X_{\text{-}S}$ . Note that when $Y\,|\,X_{1},\dots,X_{p}$ follows a generalized linear model (GLM) with no redundant covariates, the set of important variables exactly equals the set of variables with nonzero coefficients, as usual (Candès et al.,, 2018).

In our search for the Markov blanket we usually cannot possibly hope for perfect recovery, so we instead attempt to maximize the number of important variables discovered while probabilistically controlling the number of false discoveries. In this paper, as with most others in the knockoffs literature,111Janson and Su, (2016) show how the last step of knockoffs can easily be modified to control other error rates such as the $k$ -familywise error rate. we consider the false discovery rate (FDR) (Benjamini and Hochberg,, 1995), defined for a (random) selected subset of variables $\hat{S}$ as

[TABLE]

i.e., the expected fraction of discoveries that are not in the Markov blanket (false discoveries), where we use the convention that $0/0=0$ . Controlling the FDR at, say, $10\%$ is powerful as compared to controlling more classical error rates like the familywise error rate, while still being interpretable, allowing a statistician to report a conclusion such as “here is a set of covariates $\hat{S}$ , 90% of which I expect to be important.”

1.2 Our contribution

In our discussion of approaches to this problem, we will draw on a fundamental decomposition of the joint distribution $F_{Y,X}$ of $(Y,X_{1},\dots,X_{p})$ into the product of the conditional distribution $F_{Y|X}$ of $Y\,|\,X_{1},\dots,X_{p}$ and the joint distribution $F_{X}$ of $X_{1},\dots,X_{p}$ . The canonical approach to inference, which we refer to as the ‘fixed-X’ approach, assumes $F_{Y|X}$ is a member of a parametric family of conditional distributions (e.g., a GLM), while placing weak or no assumptions on $F_{X}$ . In fact, the fixed-X approach usually treats the observed values of $X_{i,1},\dots,X_{i,p}$ for $i=1,\dots,n$ as fixed; that is, it performs inference conditionally on the observed values of $X_{1},\dots,X_{p}$ in the data, which also allows the covariate rows to be drawn from different distributions or even be deterministic (fixed). The approach proposed in Candès et al., (2018), referred to therein as the ‘model-X’ approach, assumes the observations $(Y_{i},X_{i,1},\dots,X_{i,p})\stackrel{{\scriptstyle\emph{i.i.d.}}}{{\sim}}F_{Y,X}$ and places no restrictions on $F_{X}$ but assumes it is known exactly, while assuming nothing about $F_{Y|X}$ . So, to summarize slightly imprecisely, the canonical, fixed-X approach to inference places all assumptions on $F_{Y|X}$ and none on $F_{X}$ , while the model-X approach does the opposite by placing all assumptions on $F_{X}$ and none on $F_{Y|X}$ .

Note that both $F_{Y|X}$ and $F_{X}$ are exponentially complex in $p$ : in the simple case where each element of $(Y,X_{1},\dots,X_{p})$ is categorical with $k$ categories, i.e., $(Y,X_{1},\dots,X_{p})\in\{1,\dots,k\}^{p+1}$ , it is easily seen that a fully general model for $F_{Y|X}$ has $(k-1)k^{p}$ free parameters while $F_{X}$ has only slightly fewer with $k^{p}-1$ . So both fixed-X and model-X approaches astronomically reduce an exponentially large (in $p$ ) space of distributions in order to make inference feasible, highlighting the importance of robustness, assumption-checking, and domain knowledge for justifying the resulting inference; see Janson, (2017, Chapter 1) for a detailed discussion of the role of fixed-X and model-X222Therein referred to as ‘model-based’ and ‘model-free’, respectively. assumptions in high-dimensional inference. With that said, one apparent advantage of the fixed-X approach is that it does not require exact knowledge of $F_{Y|X}$ , while the model-X approach of Candès et al., (2018) does require $F_{X}$ be known exactly.

The present paper removes this apparent advantage by showing that model-X knockoffs can still provide powerful and exact, finite-sample inference even when the covariate distribution is only known up to a parameterized family of distributions (also known as a model), as opposed to known exactly. In fact, in Section 3 we will show examples in which the number of parameters we allow for $F_{X}$ ’s model is $\Omega(n^{*}p)$ , where $n^{*}$ is the total number of samples of $X$ (including unlabeled samples), which is always at least as large as the number of labeled samples $n$ , and can be much larger in some applications. This is much greater than the number of parameters allowed in the model for $F_{Y|X}$ in fixed-X inference (see Section 1.3). Table 1 provides a summarized comparison of the model flexibility allowed in the fixed-X and model-X approaches.

Of course the above discussion and table refer only to the mathematical complexity of models allowed by the fixed-X and model-X approaches. An analyst’s decision between them should depend on how well domain knowledge and/or auxiliary data support their (very different) assumptions. But in light of Table 1, it seems the conditional model-X approach is easiest to justify unless substantially more is known about $F_{Y|X}$ than $F_{X}$ .

1.3 Related work

By far the most common fixed-X approaches to inference rely on GLMs with $p$ parameters, reducing model complexity from exponential to linear in $p$ . When $p$ is smaller than the number of observations $n$ , inference for GLMs other than Gaussian linear models relies on large-sample approximation by assuming at least $p/n\rightarrow 0$ [Huber1973, Portnoy1985]. Note that the commonly studied problem of inference for a single parameter can generally be translated to FDR control using the Benjamini–Hochberg (Benjamini and Hochberg,, 1995) or Benjamini–Yekutieli (Benjamini and Yekutieli,, 2001) procedures (see, e.g., Javanmard and Javadi, (2018)), so that it makes sense to compare such inference with our paper that is focused on multiple testing. In high dimensions, i.e., when $p>n$ , even reducing the complexity of $F_{Y|X}$ to $p$ parameters with a GLM is insufficient for fixed-X inference, as GLMs become unidentifiable in this regime due to the design matrix columns being linearly dependent. Early solutions for fixed-X inference in high-dimensional GLMs relied on $\beta$ -min conditions that lower-bound the magnitude of nonzero coefficients to obtain asymptotically-valid p-values for individual variables (see, e.g., Chatterjee and Lahiri, (2013)). More recent work removes the $\beta$ -min condition in favor of strong sparsity assumptions on the coefficient vector, usually $o(\sqrt{n}/\log(p))$ nonzeros, with notable examples including the debiased Lasso (see, e.g., Zhang and Zhang, (2014); Javanmard and Montanari, (2014); van de Geer et al., (2014)) and the extended score statistic (see, e.g., Belloni et al., (2014, 2015); Chernozhukov et al., (2015); Ning and Liu, (2017)), both of which provide asymptotically-valid p-values for GLMs with some additional assumptions on the ‘compatibility’ of the design matrix. In recent work that seems to straddle the fixed-X and model-X paradigms, Zhu and Bradic, (2018) and Zhu et al., (2018) compute asymptotically-valid p-values for the Gaussian linear model without any extra restrictions like sparsity or $\beta$ -min on $F_{Y|X}$ , but with added assumptions on $F_{X}$ about the sparsity of conditional linear dependence among covariates.

Another branch of recent research called post-selection inference can be viewed as a different approach to high-dimensional inference: it aims to test random hypotheses selected by a high-dimensional regression and provide valid p-values by conditioning on the selection event (see, e.g., Fithian et al., (2014); Lee et al., (2016) for foundational contributions, and Candès et al., (2018, Appendix A) for more about the difference between post-selection inference and our approach).

The method of knockoffs was first introduced by Barber and Candès, (2015) for low-dimensional homoscedastic linear regression with fixed design. The model-X knockoffs framework proposed by Candès et al., (2018) read this idea from a different perspective, providing valid finite-sample inference with no assumptions on $F_{Y|X}$ but assuming full knowledge of $F_{X}$ . Exact knockoff generation methods have been found for $F_{X}$ following a multivariate Gaussian (Candès et al.,, 2018), a Markov chain or hidden Markov models (Sesia et al.,, 2018), a graphical model (Bates et al.,, 2020), and certain latent variable models (Gimenez et al.,, 2018). In the case that $F_{X}$ is only known approximately, the robustness of model-X knockoffs is studied by Barber et al., (2018). When $F_{X}$ is completely unknown some recent works have proposed methods to generate approximate knockoffs (Jordon et al.,, 2019; Romano et al.,, 2019; Liu and Zheng,, 2018) which have shown promising empirical results, particularly in low-dimensional problems, but come with no theoretical guarantees. In contrast, the current paper proposes to construct valid knockoffs that provide exact finite sample error control.

This paper is based on the idea of performing inference conditional on a sufficient statistic for $F_{X}$ ’s model so as to make that inference parameter-free. In low-dimensional inference, likely the simplest example of such an idea is a permutation test for independence, which can be thought of as a randomization test performed conditional on the order statistics of an observed i.i.d. vector of scalar $X$ with unknown distribution (the order statistics are sufficient for the family of all one-dimensional distributions). Although permutation tests can only test marginal independence, not conditional independence as addressed in the present paper, Rosenbaum, (1984) constructs a conditional permutation test that does test conditional independence assuming a logistic regression model for $X_{j}\mid X_{{\text{-}j}}$ , and allows the parameters of the logistic regression model to be unknown by conditioning on that model’s sufficient statistic. However that sufficient statistic is composed of inner products between the vector of observed $X_{j}$ ’s and each of the vectors of observed values of the other covariates $X_{{\text{-}j}}$ , precluding inference except in the case of covariates with a very small set of discrete values, and almost entirely precluding inference in a high-dimensional setting.555See the paragraph preceding Rosenbaum, (1984, Theorem 1) for a description of the test’s limitations. A different conditional permutation test was recently proposed by Berrett et al., (2018) to test conditional independence in the model-X framework, but while their conditioning improves robustness, they still require the same assumptions as the original conditional randomization test (Candès et al.,, 2018), namely, that $X_{j}\mid X_{{\text{-}j}}$ is known exactly. To our knowledge, the present paper is the first to use the idea of conditioning on sufficient statistics for high-dimensional inference, enabling powerful and exact FDR-controlled variable selection under arguably weaker assumptions than any existing work.

1.4 Outline

The rest of the paper is structured as follows: Section 2 describes the main result and the proposed method of conditional knockoffs to generalize model-X knockoffs to the case when $F_{X}$ is known only up to a distributional family, as opposed to exactly. Section 3 applies conditional knockoffs to three different models for $F_{X}$ , and provides explicit algorithms for constructing valid knockoffs. Simulations are also presented, showing that conditional knockoffs often loses almost no power in exchange for its increased generality over model-X knockoffs with exactly-known $F_{X}$ . Finally, Section 4 provides some synthesis of the ideas in this paper and directions for future work.

2 Main Idea and General Principles

Before going into more detail, we introduce some notation. Suppose we are given i.i.d. row vectors $(Y_{i},X_{i,1},\dots,X_{i,p})\in\mathbb{R}^{p+1}$ for $i=1,\dots,n$ . We then stack these vectors into a design matrix $\bm{X}\in\mathbb{R}^{n\times p}$ whose $i$ th row is denoted by $\bm{x}_{i}^{\top}=(X_{i,1},\dots,X_{i,p})\in\mathbb{R}^{p}$ , and a column vector $\bm{y}\in\mathbb{R}^{n}$ whose $i$ th entry is $Y_{i}$ . We are about to define model-X knockoffs $(\tilde{X}_{i,1},\dots,\tilde{X}_{i,p})$ , and $\tilde{\bm{X}}\in\mathbb{R}^{n\times p}$ will analogously denote these row vectors stacked to form a knockoff design matrix. A square bracket around matrices, such as $[\bm{X},\tilde{\bm{X}}]$ , denotes the horizontal concatenation of these matrices. We use $[p]$ for $\{1,2,\dots,p\}$ , and $i:j$ for $\{i,i+1,\dots,j\}$ for any $i\leq j$ ; for a set $A\subseteq[p]$ , let $\bm{X}_{A}$ denote the matrix with columns given by the columns of $\bm{X}$ whose indices are in $A$ , and for singleton sets we streamline notation by writing $\bm{X}_{j}$ instead of $\bm{X}_{\{j\}}$ . For sets $A_{1},\dots,A_{m}$ , denote by $\prod_{j=1}^{m}A_{j}$ their Cartesian product. For two disjoint sets $A$ and $B$ , we denote their union by $A\uplus B$ . We will denote by $\mathbb{N}$ the set of strictly positive integers.

2.1 Model-X Knockoffs

We begin with a short review of model-X knockoffs (Candès et al.,, 2018). The authors define model-X knockoffs for a random vector $X\in\mathbb{R}^{p}$ of covariates as being a random vector $\tilde{X}\in\mathbb{R}^{p}$ such that for any set $A\subseteq[p]$

[TABLE]

where the swap( $A$ ) subscript on a $2p$ -dimensional vector (or matrix with $2p$ columns) denotes that vector (matrix) with the $j$ th and $(j+p)$ th entries (columns) swapped, for all $j\in A$ . To use knockoffs for variable selection, suppose some statistics $Z_{j}$ and $\tilde{Z}_{j}$ are used to measure the importance of $X_{j}$ and $\tilde{X}_{j}$ , respectively, in the conditional distribution $Y\mid X_{1},\dots,X_{p},\tilde{X}_{1},\dots,\tilde{X}_{p}$ , with

[TABLE]

for some function $z$ such that swapping $\bm{X}_{j}$ and $\tilde{\bm{X}}_{j}$ swaps the components $Z_{j}$ and $\tilde{Z}_{j}$ , i.e., for any $A\subseteq[p]$ ,

[TABLE]

For example, $z([\bm{X},\tilde{\bm{X}}],\bm{y})$ could perform a cross-validated Lasso regression of $\bm{y}$ on $[\bm{X},\tilde{\bm{X}}]$ and return the absolute values of the $2p$ -dimensional fitted coefficient vector. More generally the $Z_{j}$ can be almost any measure of variable importance one can think of, including measures derived from arbitrarily-complex machine learning methods or from Bayesian inference, and this flexibility allows model-X knockoffs to be powerful even when $F_{Y|X}$ is quite complex.

The pairs $(Z_{j},\tilde{Z}_{j})$ of variable importance measures are then plugged into scalar-valued antisymmetric functions $f_{j}$ to produce $W_{j}=f_{j}(Z_{j},\tilde{Z}_{j})$ , which measures the relative importance of $X_{j}$ to $\tilde{X}_{j}$ . Viewed as a function of all the data, $W_{j}=w_{j}([\bm{X},\tilde{\bm{X}}],\bm{y})$ can be shown to satisfy the flip-sign property, which dictates that for any $A\subseteq[p]$ ,

[TABLE]

Taking $Z_{j}$ and $\tilde{Z}_{j}$ as the absolute values of Lasso coefficients as in the above example, one might choose $W_{j}=Z_{j}-\tilde{Z}_{j}$ , referred to in Candès et al., (2018) as the Lasso coefficient-difference (LCD) statistic. Finally, given a target FDR level $q$ , the knockoff filter selects the variables ${\hat{S}=\{j\;:\>W_{j}\geq T\}}$ where $T$ is either the knockoff threshold $T_{0}$ or the knockoff+ threshold $T_{+}$ :

[TABLE]

Candès et al., (2018, Theorem 3.4) prove that $\hat{S}$ with $T_{+}$ exactly (non-asymptotically) controls the FDR at level $q$ , and that $\hat{S}$ with $T_{0}$ exactly controls a modified FDR, $\mathbb{E}\left[\frac{|\hat{S}\,\cap\,\mathcal{H}_{0}|}{|\hat{S}|+1/q}\right]$ , at level $q$ . The key to the proof of exact control is the aforementioned flip-sign property of the $W_{j}$ , and that property follows from the following crucial property of model-X knockoffs: for any subset $A\subseteq\mathcal{H}_{0}$ ,

[TABLE]

which is proved in Candès et al., (2018, Lemma 3.2) to hold for knockoffs satisfying Equation (2.1).

The proofs of exact control required just one assumption, that one could construct knockoffs satisfying Equation (2.1). To satisfy that assumption, Candès et al., (2018) assumes throughout that $F_{X}$ is known exactly. We will relax this assumption, but first slightly generalize the definition of valid knockoffs:

Definition 2.1 (Model-X knockoff matrix).

The random matrix $\tilde{\bm{X}}\in\mathbb{R}^{n\times p}$ is a model-X knockoff matrix for the random matrix $\bm{X}\in\mathbb{R}^{n\times p}$ if for any subset $A\subseteq[p]$ ,

[TABLE]

Note that Equation (2.2) is more general than Equation (2.1), and indeed (2.1) implies (2.2) as long as the rows of $[\bm{X},\,\tilde{\bm{X}}]$ are independent. However, the proof of Candès et al., (2018)’s crucial Lemma 3.2 and, ultimately, FDR control in the form of their Theorem 3.4 used only Equation (2.2). Therefore Definition 2.1 is the ‘correct’ definition, since the ability to generate knockoffs satisfying Definition 2.1 is all that is needed for the theoretical guarantees of knockoffs in Candès et al., (2018) to hold, and it is well-defined for any matrix $\bm{X}$ , even when the rows are not independent. We will use this general definition because although we also assume samples are drawn i.i.d. from a distribution, those samples will no longer be independent when we condition on a sufficient statistic for the model for $F_{X}$ . Hereafter, model-X knockoffs and knockoffs will always refer to model-X knockoff matrices as defined by Definition 2.1 unless otherwise specified.

For completeness, we restate the FDR control theorem in Candès et al., (2018).

Theorem 2.1.

Suppose $\tilde{\bm{X}}$ is a knockoff matrix for $\bm{X}$ and the statistics $W_{j}$ ’s satisfy the flip-sign property. For any $q\in[0,1]$ , if $\hat{S}$ is selected by the knockoff method with threshold $T$ being either $T_{+}$ or $T_{0}$ , then

[TABLE]

It is worth mentioning that if $\tilde{\bm{X}}_{j}$ is identical to $\bm{X}_{j}$ , then $W_{j}=0$ and $j$ cannot be selected by the knockoff filter. Formally, we call such a column in the knockoff matrix trivial.

2.2 Conditional Knockoffs

The main idea of this paper is that if $F_{X}$ is known only up to a parametric model, and that parametric model has sufficient statistic (for $n$ i.i.d. observations drawn from $F_{X}$ ) given by $T(\bm{X})$ , then by definition of sufficiency the distribution of $\bm{X}\,|\,T(\bm{X})$ does not depend on the model parameters and is thus known exactly a priori. To leverage this for knockoffs, consider the following definition.

Definition 2.2 (Conditional model-X knockoff matrix).

The random matrix $\tilde{\bm{X}}\in\mathbb{R}^{n\times p}$ is a conditional model-X knockoff matrix for the random matrix $\bm{X}\in\mathbb{R}^{n\times p}$ if there is a statistic $T(\bm{X})$ such that for any subset $A\subseteq[p]$ ,

[TABLE]

By the law of total probability, (2.3) implies (2.2), thus conditional model-X knockoffs are also model-X knockoffs:

Proposition 2.2.

If $\tilde{\bm{X}}$ is a conditional model-X knockoff matrix for $\bm{X}$ , then it is also a model-X knockoff matrix.

Proposition 2.2 says that all the guarantees of model-X knockoffs (i.e., Theorem 2.1), such as exact FDR control and the flexibility in measuring variable importance, immediately hold more generally when $\tilde{\bm{X}}$ is a conditional model-X knockoff matrix. Definition 2.2 is especially useful when the distribution of $\bm{X}$ is known to be in a model $G_{\bm{\Theta}}=\{g_{\bm{\theta}}:\bm{\theta}\in\bm{\Theta}\}$ with parameter space $\bm{\Theta}$ , and $T(\bm{X})$ is a sufficient statistic for $G_{\bm{\Theta}}$ , because then the distribution of $\bm{X}\mid T(\bm{X})$ is known exactly even though the unconditional distribution of $\bm{X}$ is not. Exact knowledge of the distribution of $\bm{X}\mid T(\bm{X})$ in principle allows us to construct knockoffs, similar to how exact knowledge of the unconditional distribution of $\bm{X}$ has enabled all previous knockoff construction algorithms. As a simple example, when $G_{\Theta}$ is the set of all $p$ -dimensional distributions with mutually-independent entries, the set of order statistics for each column of $\bm{X}$ constitutes a sufficient statistic $T(\bm{X})$ , and a conditional knockoff matrix $\tilde{\bm{X}}$ can be generated by randomly and independently permuting each column of $\bm{X}$ . Unfortunately for more interesting models that allow for dependence among the covariates, even for canonical $G_{\bm{\Theta}}$ like multivariate Gaussian, the distribution of $\bm{X}\mid T(\bm{X})$ is often much more complex than those for which knockoff constructions already exist. Using novel methodological and theoretical tools, in Section 3 we provide efficient and exact algorithms for constructing nontrivial conditional knockoffs when $F_{X}$ comes from each of the following three models:

Low-dimensional Gaussian:

[TABLE]

when $n>2p$ . In this case, the number of model parameters is $p+\frac{p(p+1)}{2}=\Omega(p^{2})$ , and also $\Omega(np)$ in the most challenging case when $p=\Omega(n)$ . 2. 2.

Gaussian graphical model:

[TABLE]

for some known sparsity pattern $E$ . For example, $\bm{\Sigma}^{-1}$ could be banded with bandwidth as large as $n/8-1$ ,666Here we assume $n/8\leq p$ . allowing a number of parameters as large as $p+\left(\frac{np}{8}-\frac{n(n-8)}{128}\right)=\Omega(np)$ . Note that $p$ is not explicitly constrained, so this model allows both low- and high-dimensional data sets. 3. 3.

Discrete graphical model:

[TABLE]

for some known positive integers $K_{1},\dots,K_{p}$ and known sparsity pattern $E$ , where $N_{E}(j)$ is the closed neighborhood of $j$ . For example, $X$ could be a $K$ -state (non-stationary) Markov chain whose $K-1+(p-1)K(K-1)$ parameters are the probability mass function of $X_{1}$ and the transition matrices $\mathbb{P}\left(\left.X_{j}\ \right|X_{j-1}\right)$ for each $j\in\{2,\dots,p\}$ , where $K$ can be as large as $\sqrt{\frac{n-2}{2}}$ , allowing a number of parameters as large as $\sqrt{\frac{n-2}{2}}-1+(p-1)\left(\sqrt{\frac{n-2}{2}}\right)\left(\sqrt{\frac{n-2}{2}}-1\right)=\Omega(np)$ . Again, $p$ is not explicitly constrained, so this model allows both low- and high-dimensional data sets.

Remark 1.

It is worth mentioning that conditioning may shrink the set of nonnull hypotheses. For instance, if $\mathcal{H}_{0}=\emptyset$ and $T(\bm{X})$ is chosen to be $\bm{X}$ , then all variables are automatically null conditional on $T(\bm{X})$ , and thus conditional knockoffs cannot select any nonnull variables. For a detailed discussion, see Appendix C.

Remark 2.

Any algorithm that generates conditional knockoffs given one sufficient statistic $T(\bm{X})$ (i.e., satisfying Equation (2.3) for $T(\bm{X})$ ) by definition is also a valid algorithm for generating conditional knockoffs given any sufficient statistic $S(\bm{X})$ that is a function of $T(\bm{X})$ . This means that any valid conditional knockoff algorithm satisfies Equation (2.3) for the minimal sufficient statistic, since by definition a minimal sufficient statistic is a function of any other sufficient statistic. So we could say that the minimal sufficient statistic is in some sense the optimal one to condition on, in that the choice to condition on the minimal sufficient statistic allows for the most general set of conditional knockoff algorithms of any sufficient statistic one could choose to condition on for a given model.

2.3 Integrating Unlabeled Data

In addition to the $n$ labeled pairs $\left\{(Y_{i},\bm{x}_{i})\right\}_{i=1}^{n}$ , we might also have unlabeled data $\{\bm{x}_{i}^{(u)}\}_{i=1}^{n^{(u)}}$ , i.e., covariate samples without corresponding responses/labels. This extra data can be integrated seamlessly into the construction of conditional knockoffs: stack the labeled covariate matrix $\bm{X}$ on top of the unlabeled covariate matrix $\bm{X}^{(u)}$ to get $\bm{X}^{*}\in\mathbb{R}^{n^{*}\times p}$ , where $n^{*}=n+n^{(u)}$ , then construct conditional knockoffs $\tilde{\bm{X}}^{*}$ for $\bm{X}^{*}$ , and finally take $\tilde{\bm{X}}$ to be the first $n$ rows of $\tilde{\bm{X}}^{*}$ .

Proposition 2.3.

Suppose the rows of $\bm{X}^{*}$ are i.i.d. covariate vectors and $\bm{X}$ is the matrix composed of the first $n$ rows of $\bm{X}^{*}$ . Let $\bm{y}$ be the response vector for $\bm{X}$ . If for some statistic $T(\bm{X}^{*})$ and any set $A\subseteq[p]$ ,

[TABLE]

*then if $\tilde{\bm{X}}$ is the matrix composed of the first $n$ rows of $\tilde{\bm{X}}^{*}$ , then $\tilde{\bm{X}}$ is a model-X knockoff matrix for $\bm{X}$ . *

Note that by taking $T(\bm{X}^{*})$ to be constant, the same result holds unconditionally: if ${\tilde{\bm{X}}^{*}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{y}\,|\,\bm{X}^{*}}$ and $[\bm{X}^{*},\,\tilde{\bm{X}}^{*}]_{\text{swap}(A)}\,{\buildrel\mathcal{D}\over{=}}\,[\bm{X}^{*},\,\tilde{\bm{X}}^{*}]$ for any $A\subseteq[p]$ , then $\tilde{\bm{X}}$ is a valid knockoff matrix for $\bm{X}$ . Thus constructing knockoffs for $\bm{X}^{*}$ , conditional or otherwise, produces valid knockoffs for $\bm{X}$ automatically. Of course, if $F_{X}$ is known and the rows of $\bm{X}^{*}$ are i.i.d., it is natural to construct each row of $\tilde{\bm{X}}^{*}$ independently, in which case the presence of $\bm{X}^{(u)}$ changes nothing about the construction of the relevant knockoffs $\tilde{\bm{X}}$ . But as seen in Section 2.2, when $F_{X}$ is not known exactly the flexibility with which we can model it depends on the sample size, with the number of parameters allowed to be as large as $\Omega(np)$ in all the models in this paper. What Proposition 2.3 shows is that $n$ can be replaced with $n^{*}$ , which can dramatically increase the modeling flexibility allowed by conditional knockoffs, especially in high dimensions. For example, our conditional knockoffs construction in Section 3.1 for arbitrary multivariate Gaussian distributions naively requires $n>2p$ , but we now see it actually just requires $n^{*}>2p$ , which is much easier to satisfy when $n^{(u)}$ is large, as it often is in, for instance, genomics or economics applications. Even when $n$ alone is large enough to construct nontrivial knockoffs for a desired model, constructing conditional knockoffs with unlabeled data as described in this section will tend to increase power.

3 Conditional Knockoffs for Three Models of Interest

In this section, we provide efficient algorithms to generate exact conditional model-X knockoffs under three different models for $F_{X}$ , as well as numerical simulations comparing the variable selection power of the knockoffs thus constructed with those constructed by existing algorithms that require $F_{X}$ be known exactly.

All proofs are deferred to Appendix A. Any sampling described in the algorithms is conducted independently of all previous sampling in the same algorithm, unless stated otherwise. All simulations use a Gaussian linear model for the response: $Y_{i}\mid\bm{x}_{i}\sim\mathcal{N}(\frac{1}{\sqrt{n}}\bm{x}_{i}^{\top}\bm{\beta},1)$ where $\bm{\beta}$ has 60 non-zero entries with random signs and equal amplitudes. Note the sparsity and magnitude equalities are simply chosen for convenience—we present additional simulations varying these choices in Appendix D.2.

We remind the reader that, although we use linear regression as an illustrative example in the simulations, our methods apply to more general regressions, and all the same simulations are also rerun with a nonlinear model (logistic regression) with similar results, presented in Appendix D.1. We use the LCD knockoff statistic with tuning parameter chosen by 10-fold cross-validation and the knockoff+ threshold with target FDR $q=20$ %; see Section 2.1 for details. Only power curves (power $=\mbox{$ \mathbb{E}\left[\frac{|S\cap\hat{S}|}{|S|}\right] $}$ ) are shown because the FDR is always controlled (both theoretically and empirically). The procedure we compare to, unconditional knockoffs, refers to model-X knockoffs where $F_{X}$ is taken to be known exactly (knockoff statistics and thresholds are chosen identically).

3.1 Low-Dimensional Multivariate Gaussian Model

Despite the focus in variable selection on high-dimensional problems, we start with a low-dimensional example as it represents an interesting and instructive case. Suppose that

[TABLE]

for some unknown $\bm{\mu}$ and positive definite $\bm{\Sigma}$ . Let $\hat{\bm{\mu}}:=\bm{X}^{\top}\bm{1}_{n}/n$ denote the vector of column means of $\bm{X}$ , and let $\hat{\bm{\Sigma}}:=(\bm{X}-\bm{1}_{n}\hat{\bm{\mu}}^{\top})^{\top}(\bm{X}-\bm{1}_{n}\hat{\bm{\mu}}^{\top})/n$ be the empirical covariance matrix of $\bm{X}$ . Then $T(\bm{X})=(\hat{\bm{\mu}},\hat{\bm{\Sigma}})$ constitutes a (minimal, complete) sufficient statistic for the model (3.1) for $\bm{X}$ .

3.1.1 Generating Conditional Knockoffs

When $n>2p$ , we can construct knockoffs for $\bm{X}$ conditional on $\hat{\bm{\mu}}$ and $\hat{\bm{\Sigma}}$ via Algorithm 1.

In Algorithm 1, $n>2p$ is needed because in Line 3 the $n\times(2p+1)$ matrix $\left[\bm{1}_{n},\,\bm{X},\,\bm{W}\right]$ must have at least as many rows as columns to be a valid input to the Gram–Schmidt orthonormalization algorithm. The astute reader may notice a strong similarity between Equation (3.2) and the fixed-X knockoff construction in Barber and Candès, (2015, Equation (1.4)). Indeed nearly the same tools can be used to find a suitable $\bm{s}$ ; in Appendix B.1 we slightly adapt three methods from Barber and Candès, (2015) and Candès et al., (2018) for computing suitable $\bm{s}$ . The computational complexity of Algorithm 1 depends on the method used to find $\bm{s}$ , with the fastest option requiring $O\left(np^{2}\right)$ time.

The differences between Equation (3.2) and the fixed-X knockoff construction are the additional accounting for the mean by adding/subtracting $\hat{\bm{\mu}}$ , the lack of requiring that $\bm{X}$ have normalized columns, the “ $\prec$ ” relationships (as opposed to “ $\preceq$ ”), and most importantly the requirement that $\bm{U}$ be random. Indeed, as can be seen in the proof of Theorem 3.1, the precise uniform distribution of $\bm{U}$ is crucial. And it bears repeating that unlike fixed-X knockoffs, Algorithm 1 produces valid model-X knockoffs and hence permits importance statistics without the “sufficiency property” and applies to any $F_{Y|X}$ , not just homoscedastic linear regression.

Theorem 3.1.

Algorithm 1 generates valid knockoffs for model (3.1).

The challenge in proving Theorem 3.1 is that the conditional distribution of $[\bm{X},\tilde{\bm{X}}]\mid T(\bm{X})$ is supported on an uncountable subset of zero Lebesgue measure, and its distribution is only defined through the distribution of $\bm{X}\mid T(\bm{X})$ and the conditional distribution of $\tilde{\bm{X}}\mid\bm{X}$ . Although $\bm{X}\mid T(\bm{X})$ and $\tilde{\bm{X}}\mid\bm{X}$ are both conditionally uniform on their respective supports, and the latter’s normalizing constant does not depend on $\bm{X}$ , these facts alone are not sufficient to conclude that $[\bm{X},\tilde{\bm{X}}]\mid T(\bm{X})$ is uniform on its support (see Appendix A.2.1 for a simple counterexample), which is what we need to prove. Although these distributions on zero-Lebesgue-measure manifolds can be characterized using geometric measure theory (as in, e.g., Diaconis et al., (2013)), we bypass this approach by directly using the concept of invariant measures from topological measure theory; see Appendix A.2.2.

A useful consequence of Theorem 3.1 is the double robustness property that if knockoffs are constructed by Algorithm 1 and knockoff statistics are used which obey the sufficiency property of Barber and Candès, (2015) (that is, the knockoff statistics only depend on $\bm{y}$ and $[\bm{X},\,\tilde{\bm{X}}]$ through $[\bm{1}_{n},\bm{X},\,\tilde{\bm{X}}]^{\top}\bm{y}$ and $[\bm{1}_{n},\bm{X},\,\tilde{\bm{X}}]^{\top}[\bm{1}_{n},\bm{X},\,\tilde{\bm{X}}]$ ), then the resulting variable selection controls the FDR exactly as long as at least one of the following holds:

•

$\bm{x}_{i}\stackrel{{\scriptstyle\emph{i.i.d.}}}{{\sim}}\mathcal{N}(\bm{\mu},\bm{\Sigma})$ for some $\bm{\mu}$ and $\bm{\Sigma}$ , both unknown (regardless of $F_{Y|X}$ ), or

•

$y_{i}\,|\,\bm{x}_{i}\stackrel{{\scriptstyle\emph{i.i.d.}}}{{\sim}}\mathcal{N}(\bm{x}_{i}^{\top}\bm{\beta},\sigma^{2})$ for some $\bm{\beta}$ and $\sigma^{2}$ , both unknown (regardless of $F_{X}$ ).

In Appendix B.1 we extend Algorithm 1 to the case when the mean is known (Algorithm 7) or a subset of columns of $\bm{X}$ are additionally conditioned on (Algorithm 8). Both extensions may be of independent interest, but will also be used as subroutines when generating knockoffs for Gaussian graphical models in Section 3.2.

3.1.2 Numerical Examples

We present two simulations comparing the power of conditional knockoffs to the analogous unconditional construction that uses the exactly-known $F_{X}$ . We remind the reader that the simulation setting is at the beginning of Section 3.

The vector $\bm{s}$ in Algorithm 1 is computed using the SDP method of Equation (B.3), and the analogous vector for the unconditional construction is chosen by the analogous SDP method (Candès et al.,, 2018). Although in both examples $n^{*}>2p$ , the number of unknown parameters in the Gaussian model for $F_{X}$ is $p+\frac{p(p+1)}{2}>500,000$ , vastly larger than any of the sample sizes.

Figure 1a fixes $p=1000$ and plots the difference in power between unconditional and conditional knockoffs as $n>2p$ increases for a few different signal amplitudes. The power of the conditional and unconditional constructions is quite close except when $n=2.5p$ is just above its threshold of $2p$ , and even then the power of the conditional construction is respectable.

Figure 1b shows how unlabeled samples improve the power of conditional knockoffs. The model is the same as the first example but the labeled sample size is fixed at $n=300$ and we vary the number of unlabeled samples. Again, the power of the conditional and unconditional constructions is extremely close except when $n^{*}=2.3p$ is just above its threshold, and again even in that setting the power of the conditional construction is respectable. Note that unlabeled samples here have enabled the low-dimensional Gaussian construction to apply in a high-dimensional setting with $n<p$ , since $n^{*}>2p$ .

3.2 Gaussian Graphical Model

Ignoring unlabeled data, the method of the previous subsection is constrained to low-dimensional (or perhaps more accurately, medium-dimensional, since it allows $p=\Omega(n)$ ) settings and cannot be immediately extended to high dimensions. In many applications however, particularly in high dimensions, the covariates are modeled as multivariate Gaussian with sparse precision matrix $\bm{\Sigma}^{-1}$ , and when the sparsity pattern is known a priori, we can condition on much less. For instance, time series models such as autoregressive models assume a banded precision matrix with known bandwidth, and the model used in this subsection would also allow for nonstationarity. Spatial models often assume a (known) neighborhood structure such that the only nonzero precision matrix entries are index pairs corresponding to spatial neighbors.

Precisely, suppose $\bm{X}$ ’s rows $\bm{x}_{i}^{\top}$ are i.i.d. draws from a distribution known to be in the model

[TABLE]

where $E\subseteq[p]\times[p]$ is some symmetric set of integer pairs (i.e., $(j,k)\in E\Rightarrow(k,j)\in E$ ) with no self-loops. Then the undirected graph $G\;:=\;([p],E)$ defines a Gaussian graphical model with vertex set $[p]$ and edge set $E$ . For any $j\in[p]$ , define $I_{j}=\{k:(j,k)\in E\}$ for the vertices that are adjacent to $j$ . We will use the terms ‘vertex’ ( $j\in[p]$ ) and ‘variable’ ( $X_{j}$ ) interchangeably. $\hat{\bm{\mu}}$ and $\hat{\bm{\Sigma}}_{E}$ together constitute a sufficient statistic, where $\hat{\bm{\Sigma}}_{E}:=\left\{\hat{\bm{\Sigma}}_{j,k}:j=k\text{ or }(j,k)\in E\right\}$ . We will show in this section how to generate conditional knockoffs, and we will characterize the sparsity patterns $E$ for which we can generate knockoffs with $\tilde{\bm{X}}_{j}\neq\bm{X}_{j}$ for all $j\in[p]$ .

Remark 3.

More generally, sparsity in the precision matrix, but with unknown sparsity pattern, is a common assumption in Gaussian graphical models which are used to model many types of data in high dimensions such as gene expressions. Although the construction in this section no longer holds exactly when the sparsity pattern is unknown, approximate knockoffs could still be constructed by first using a method for estimating the sparsity pattern (Bühlmann and van de Geer,, 2011, Chapter 13) and then treating it as known. Note that we only require the edge set $E$ to contain all non-zero entries of $\bm{\Sigma}^{-1}$ , which is no harder than the exact identification of the non-zero entries.

3.2.1 Generating Conditional Knockoffs by Blocking

First consider the ideal case when the graph $G$ separates into disjoint connected components whose respective vertex sets are $V_{1},\dots,V_{\ell}$ . Then $X$ can be divided into independent subvectors, $X_{V_{1}},\dots,X_{V_{\ell}}$ , and if each $|V_{k}|<n/2$ , we can construct low-dimensional conditional knockoffs separately and independently for each $\bm{X}_{V_{k}}$ as in Section 3.1. Moving to the general case when $G$ is connected, we can do something intuitively similar by conditioning on a subset of variables in addition to $\hat{\bm{\mu}}$ and $\hat{\bm{\Sigma}}_{E}$ . If there is a subset of vertices $B$ such that the subgraph $G_{B}$ induced by deleting $B$ separates into small disjoint connected components, then we should be able to construct conditional knockoffs as above for $\bm{X}_{B^{c}}$ by conditioning on $\bm{X}_{B}$ . We think of the variables in $B$ as being blocked to separate the graph into small disjoint parts, hence we refer to this $B$ as a blocking set.

The following definition formalizes when we can apply the above procedure, and Algorithm 2 states that procedure precisely.

Definition 3.1.

A graph $G$ is $n$ -separated by a set $B\subset[p]$ if the subgraph $G_{B}$ induced by deleting all vertices in $B$ has connected components whose respective vertex sets we denote by $V_{1},\dots,V_{\ell}$ such that for all $k\in[\ell]$ ,

[TABLE]

where $I_{V_{k}}\;:=\;\bigcup\limits_{j\in V_{k}}I_{j}$ is the neighborhood of $V_{k}$ in $G$ .

Note that when the $V_{k}$ separated $X$ into independent subvectors, we only needed $2|V_{k}|<n$ ; now that they only represent conditionally independent subvectors, we must also account for $V_{k}$ ’s neighbors in $B$ that we condition on, resulting in the requirement that $2|V_{k}|+|I_{V_{k}}\,\cap\,B|<n$ .

Algorithm 2 constructs knockoffs for the model (3.3) by first conditioning on $\bm{X}_{B}$ and then running a slight modification of Algorithm 1 (Algorithm 8 in Appendix B.1.3) on the variables/columns $V_{k}$ corresponding to the induced subgraphs. The computational complexity of Algorithm 2 is $O\left(n\sum_{k=1}^{\ell}\left(\left|I_{V_{k}}\,\cap\,B\right|^{2}\left|V_{k}\right|+|V_{k}|^{2}\right)\right)$ , which is upper-bounded by $O\left(\ell nn^{\prime 2}+np\max_{k\in[\ell]}|I_{V_{k}}\,\cap\,B|^{2}\right)$ (both complexities assume the most efficient construction of $\bm{s}$ is used as a primitive in Algorithm 8).

Theorem 3.2.

Algorithm 2 generates valid knockoffs for model (3.3).

Algorithm 2 raises two key issues: how to find a suitable blocking set $B$ , and how to address the fact that $\tilde{\bm{X}}_{B}=\bm{X}_{B}$ are trivial knockoffs, so using conditional knockoffs from Algorithm 2 will have no power to select any of the variables in $B$ .

Algorithm 3 provides a simple greedy way to find a suitable $B$ or, given an initial blocking set $B$ , can also be used to shrink $B$ (see Proposition B.3). The algorithm visits every vertex in $G$ once in the order $\pi$ and decides whether each vertex it visits is blocked or free (not blocked). Meanwhile, it constructs a graph $\bar{G}$ from $G$ , which gets expanded every time a vertex $j$ is determined to be free: all pairs of $j$ ’s neighbors in $\bar{G}$ get connected (if not already) and a new vertex $\tilde{j}$ that has the same neighborhood as $j$ in $\bar{G}$ is added to the graph. A vertex is blocked if, when it is visited, its degree in $\bar{G}$ is greater than $n^{\prime}-3$ .

Proposition 3.3.

If $B$ is the blocking set determined by Algorithm 3 with input $(\pi,n^{\prime})$ , then $G$ is $n$ -separated by $B$ for any $n\geq n^{\prime}$ .

Algorithm 3 is meant to be intuitive but a more efficient implementation is given in Appendix B.2. Algorithm 3 can also be made even greedier by choosing the next $j$ at each step as the unvisited vertex in $[p]$ with the smallest degree in $\bar{G}$ (breaking ties at random), instead of following the ordering $\pi$ . The algorithm also takes an input $n^{\prime}$ , which one may prefer to choose smaller than $n$ for computational or statistical efficiency, as we investigate in Section 3.2.2 (smaller $n^{\prime}$ will mean smaller $V_{k}$ to generate knockoffs for in Line 2 of Algorithm 2). The flexibility in both $\pi$ and $n^{\prime}$ is mainly motivated by the second aforementioned issue of trivial knockoffs $\tilde{\bm{X}}_{B}=\bm{X}_{B}$ , addressed next.

An intuitive solution to prevent the trivial knockoffs $\tilde{\bm{X}}_{B}$ in Algorithm 2 is to split the rows of $\bm{X}$ in half and run Algorithm 2 on each half with disjoint blocking sets $B_{1}$ and $B_{2}$ such that $G$ is $n/2$ -separated by both blocking sets. Then the knockoffs for variables in $B_{1}$ will be trivial for half the rows of $\tilde{\bm{X}}$ and those for variables in $B_{2}$ will be trivial for the other half of the rows of $\tilde{\bm{X}}$ , but since $B_{1}$ and $B_{2}$ are disjoint, no variables will have entirely trivial knockoffs. Even though some knockoff variables are trivial for half their rows, we find the power loss for these variables to be surprisingly small, see the simulations in Section 3.2.2.

This data-splitting idea is generalized in Algorithm 4 to splitting the rows of $\bm{X}$ into $m$ folds and running Algorithm 2 on each fold with a different input $B$ .

In Algorithm 4, since $\bigcup\limits_{i=1}^{m}B_{i}^{c}=[p]$ , for each $j\in[p]$ there is at least one $i$ such that $j\notin B_{i}$ , and thus $\tilde{\bm{X}}_{j}\neq\bm{X}_{j}$ . Before characterizing when it is possible to find such $B_{i}$ , we formalize the requirements of Algorithm 4 into a definition.

Definition 3.2.

$G=([p],E)$ is $(m,n)$ -coverable if there exist $B_{1},\dots,B_{m}$ subsets of $[p]$ and integers $n_{1}\dots,n_{m}$ such that $\bigcup\limits_{i=1}^{m}B_{i}^{c}=[p]$ , $G$ is $n_{i}$ -separated by $B_{i}$ for all $i=1,\dots,m$ , and $\sum\limits_{i=1}^{m}n_{i}\leq n$ .

The following common graph structures are $(m,n)$ -coverable:

•

If the largest connected component of $G$ is not larger than $(n-1)/2$ , $G$ is $(1,n)$ -coverable.

•

If $G$ is a Markov chain of order $r$ (making the model a time-inhomogeneous AR( $r$ ) model), i.e., $E=\{(i,j):1\leq|i-j|\leq r\}$ , and $n\geq 2+8r$ , then $G$ is $(2,n)$ -coverable.

•

If $G$ is a $m$ -colorable (also known as $m$ -partite), i.e., the vertices can be divided into $m$ disjoint sets such that the vertices in each subset are not adjacent, and $n\geq m(3+\max_{j}|I_{j}|)$ , then $G$ is $(m,n)$ -coverable. For example,

–

A tree ( $m=2$ ) in which the maximal number of children of any vertex is no more than $(n-8)/2$ ,

–

A circle with $p$ even ( $m=2$ ) and $n\geq 10$ , or with $p$ odd ( $m=3$ ) and $n\geq 15$ ,

–

A finite subset of the $d$ -dimensional lattice $\mathbb{Z}^{d}$ where vertices separated by distance 1 are adjacent ( $m=2$ ) and $n\geq 6+4d$ .

For simple graphs such as those listed above, finding appropriate blocking sets $B_{i}$ can be done by inspection; see Appendix B.2.3. More generally, determining $(m,n)$ -coverability for an arbitrary graph or, given an $(m,n)$ -coverable graph, determining blocking sets $B_{i}$ ’s that are optimal in some sense (e.g., minimizing $\Big{|}\bigcup\limits_{i\leq m}B_{i}\Big{|}$ ) are beyond the scope of this work. However, in Algorithm 11 in Appendix B.2, we provide a randomized greedy search for suitable $B_{i}$ ’s that be applied in practice when the graph structure is too complex to find such $B_{i}$ ’s by inspection.

3.2.2 Numerical Examples

We present two simulations comparing the power of Algorithm 4 with its unconditional counterpart, one a time-varying AR $(1)$ model and the other a time-varying AR $(10)$ . Line 2 of Algorithm 2 uses Algorithm 1 with the vector $\bm{s}$ computed using the SDP method of Equation (B.3), and the unconditional construction also uses the SDP method (Candès et al.,, 2018). Algorithm 4 was run with $m=2$ and $B_{1}$ and $B_{2}$ chosen by fixing $n^{\prime}$ (specified in the following paragraphs) and running Algorithm 3 twice with two different $\pi$ ’s. The first run used the original variable ordering for $\pi$ , and the second run used ordered $B_{1}$ followed by the ordered remaining variables.777This is a nonrandomized version of Algorithm 11, which works well for AR models because of their graph structure.

We remind the reader that the simulation setting is at the beginning of Section 3.

In Figure 2a, the $\bm{x}_{i}\in\mathbb{R}^{2000}$ are i.i.d. AR $(1)$ with autocorrelation coefficient 0.3 (although the autocorrelation coefficient does not vary with time, this is not assumed by Algorithm 4). We chose $n^{\prime}=40$ , resulting in $210$ variables that are each blocked in half the samples. The number of unknown parameters is $3p-1=5,999$ while the sample sizes simulated are much smaller, $n\leq 350$ , yet the power of conditional knockoffs is nearly indistinguishable from that of unconditional knockoffs which uses the exactly-known distribution of $X$ .

In Figure 2b, the $\bm{x}_{i}\in\mathbb{R}^{2000}$ are time-varying AR(10); specifically, $\bm{x}_{i}\stackrel{{\scriptstyle\emph{i.i.d.}}}{{\sim}}\mathcal{N}(\bm{0},\bm{\Sigma})$ where $\bm{\Sigma}$ is the renormalization of $\bm{\Sigma}^{0}$ to have 1’s on the diagonal, and $\left(\bm{\Sigma}^{0}\right)^{-1}_{j,k}=\mbox{$ \mathbf{1}{\left{j=k\right}} $}-0.05\cdot\mbox{$ \mathbf{1}{\left{1\leq|j-k|\leq 10\right}} $}.$ We chose $n^{\prime}=50$ , resulting in $1,660$ variables that are each blocked in half the samples. The number of unknown parameters is $2p+10p-10\times 11/2=23,945$ while the sample sizes are again much smaller, $n\leq 500$ , and the power difference between conditional and unconditional knockoffs remains very slight.

Note that the simulation in Figure 2a blocked on just roughly 10% of its variables (i.e., $|B_{1}\cup B_{2}|/p\approx 10\%$ ), and since the signals are uniformly distributed, one might worry that in specific applications where the blocked variables and signals happened to align, the power loss might be much worse. But Figure 2b’s simulation blocked on over 80% of its variables and still suffered very little power loss compared to unconditional knockoffs, suggesting that even the blocking of signal variables has only a small effect on power thanks to the data splitting in Algorithm 4.

Finally, we examine the sensitivity of the power of conditional knockoffs to the choice of $n^{\prime}$ in Algorithm 3 for choosing the $B_{i}$ . In the case of AR( $1$ ) with $n=300$ and $p=2000$ , Figure 3a shows the averaged density8883200 independent simulations were averaged and the kernel density estimate used a Gaussian kernel with a bandwidth of 0.01. of original-knockoff correlations $\tilde{\rho}_{j}=\bm{X}_{j}^{\top}\tilde{\bm{X}}_{j}/(\|\bm{X}_{j}\|\|\tilde{\bm{X}}_{j}\|)$ for three different choices of $n^{\prime}$ , and Figure 3b shows the corresponding power curves. Recall that smaller $n^{\prime}$ means blocking on more variables but generating better knockoffs for the non-blocked variables in each step $i$ of Algorithm 4. Figure 3a shows quite different correlation profiles for different $n^{\prime}$ , with $n^{\prime}=40$ seeming to provide the density with mass most concentrated to the left. Indeed Figure 3b shows $n^{\prime}=40$ is most powerful, but only by a small margin—the power is quite insensitive to the choice of $n^{\prime}$ . In applications, the choice of $n^{\prime}$ may rely on an approximate version of Figure 3a obtained by simulating $\bm{X}$ from an estimated model.

In Appendix D, we provide additional experiments that compare the performance of conditional knockoffs that are generated using different sufficient statistics (Appendix D.3) and examine the scenario where a superset of the edge set $E$ is unknown and is instead estimated using the data (Appendix D.4).

3.3 Discrete Graphical Model

We now turn to applying conditional knockoffs to discrete models for $X$ . Such models are used, for example, for survey responses, general binary covariates, and single nucleotide polymorphisms (mutation counts at loci along the genome) in genomics. Many discrete models assume some form of local dependence, for instance in time or space. We will show how to construct conditional knockoffs when that local dependence is modeled by (undirected) graphical models (see, e.g., Edwards, (2000, Chapter 2)), for example, Ising models, Potts models, and Markov chains.

A random vector $X$ is Markov with respect to a graph $G=([p],E)$ if for any two disjoint subsets $A,A^{\prime}\subset[p]$ and a cut set $B\subset[p]$ such that every path from $A$ to $A^{\prime}$ passes through $B$ , it holds that $X_{A}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{A^{\prime}}\mid X_{B}$ . Denote by $I_{j}$ the vertices adjacent to $j$ in $G$ (excluding $j$ itself). $X$ being Markov implies the local Markov property that $X_{j}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{(\{j\}\cup I_{j})^{c}}\mid X_{I_{j}}$ .

In this section, we assume $X$ is locally Markov with respect to a known graph $G$ and each variable $X_{j}$ takes ${K}_{j}\geq 2$ discrete values (for simplicity label these values $[{K}_{j}]=\{1,\dots,{K}_{j}\}$ ). Although the algorithms in this section can be applied when ${K}_{j}$ is infinite, we assume for simplicity that ${K}_{j}$ is finite. Formally, we assume

[TABLE]

3.3.1 Generating Conditional Knockoffs by Blocking

Our algorithm for generating conditional knockoffs for discrete graphical models uses again the ideas of blocking and data splitting in Section 3.2. However, unlike Section 3.2 which built upon the low-dimensional construction of Section 3.1, there is no known efficient algorithm for constructing conditional knockoffs for general discrete models in low dimensions. As such, instead of blocking to isolate small graph components, we now block to isolate individual vertices, and as such need to be more careful with data splitting to ensure the resulting knockoffs remain powerful.

Suppose $B$ is a cut set such that every path connecting any two different vertices in $B^{c}$ passes through $B$ ; call such a set a global cut set with respect to $G$ . The local Markov property implies the elements of $X_{B^{c}}$ are conditionally independent given $X_{B}$ :

[TABLE]

where we used the fact that for any $j\in B^{c}$ , $I_{j}\subseteq B$ and $X_{j}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{B\setminus I_{j}}\mid X_{I_{j}}$ . For any $A\subseteq[p]$ and $k_{1},\dots,k_{p}$ , denote by $\bm{k}_{A}$ the vector of $k_{j}$ ’s for $j\in A$ and by $[\bm{K}_{A}]$ the cartesian product $\prod\limits_{j\in A}[{K}_{j}]$ . Then the conditional probability $\mathbb{P}\left(X_{j}\left|\ X_{I_{j}}\right.\right)$ can be written as

[TABLE]

with parameters $\theta_{j}(k_{j},\bm{k}_{I_{j}})\in[0,1]$ for all $k_{j}$ , $\bm{k}_{I_{j}}$ , with the convention that $0^{0}:=\;1$ . Let $\psi_{B}(X_{B})$ be the probability mass function for $X_{B}$ , the joint distribution for $n$ i.i.d. samples from the graphical model is then

[TABLE]

where $N_{j}(k_{j},\bm{k}_{I_{j}})=\sum_{i=1}^{n}\mbox{$ \mathbf{1}{\left{X{i,j}=k_{j},\bm{X}{i,{I{j}}}=\bm{k}{I{j}}\right}} $}$ . Let $T_{B}(\bm{X})$ be the statistic that includes $\bm{X}_{B}$ and the counts $N_{j}(k_{j},\bm{k}_{I_{j}})$ for all $j\in B^{c}$ and all possible $(k_{j},\bm{k}_{I_{j}})$ . Then $T_{B}(\bm{X})$ is a sufficient statistic for model (3.4). Conditional on $T_{B}(\bm{X})$ , the random vectors $\{\bm{X}_{j},j\in B^{c}\}$ are independent and each $\bm{X}_{j}$ is uniformly distributed on all $\bm{w}\in[{K}_{j}]^{n}$ such that $\sum_{i=1}^{n}\mbox{$ \mathbf{1}{\left{w{i}=k_{j},\bm{X}{i,{I{j}}}=\bm{k}{I{j}}\right}} $}=N_{j}(k_{j},\bm{k}_{I_{j}})$ for any $(k_{j},\bm{k}_{I_{j}})$ . Algorithm 5 generates knockoffs conditional on $T_{B}(\bm{X})$ by, for each $j$ , uniformly permuting subsets of entries of $\bm{X}_{j}$ to produce $\tilde{\bm{X}}_{j}$ . The subsets of entries are defined by blocks of identical rows of $\bm{X}_{I_{j}}$ so that $\sum_{i=1}^{n}\mbox{$ \mathbf{1}{\left{\tilde{\bm{X}}{i,j}=k_{j},\bm{X}{i,{I{j}}}=\bm{k}{I{j}}\right}} $}=N_{j}(k_{j},\bm{k}_{I_{j}})$ , as required.

The computational complexity of Algorithm 5 is $O\left(\sum\limits_{j\in B^{c}}(n+\min(\prod\limits_{\ell\in I_{j}}{K}_{\ell},n|I_{j}|))\right)$ , which is shown in Appendix B.3. If $n>\max_{j\in B^{c}}\prod\limits_{\ell\in I_{j}}{K}_{\ell}$ , as needed to guarantee nontrivial knockoffs for all $j\in B^{c}$ are generated with positive probability, then the complexity can be simplified to $O\left(n(p-|B|)\right)$ . In general, Algorithm 5’s computational complexity is bounded by the simple expression $O(np\bar{d})$ , where $\bar{d}$ is the average degree in $B^{c}$ .

Theorem 3.4.

Algorithm 5 generates valid knockoffs for model (3.4).

As with Algorithm 2, in Algorithm 5 variables in $B$ are blocked and their knockoffs are trivial: $\tilde{\bm{X}}_{B}=\bm{X}_{B}$ . One way to mitigate this drawback is to, after running Algorithm 5, expand the graph to include the generated knockoff variables and then conduct a second knockoff generation with the expanded graph. We elaborate on this idea and present Algorithm 12, a modified version of Algorithm 5, in Appendix B.4.

Another systematic way to address this issue is to take the same approach as Algorithm 4 by splitting the data and running Algorithm 5 (or Algorithm 12) on each split with different $B$ ’s; see Algorithm 6.

If $n_{i}>\max\limits_{j\in B_{i}^{c}}\prod\limits_{\ell\in I_{j}}{K}_{\ell}$ for all $i\leq m$ and all the model parameters $\theta_{j}(k_{j},\bm{k}_{I_{j}})$ are positive, then Algorithm 6 produces nontrivial knockoffs for all $j$ with positive probability. Note that in the continuous case, similar mild conditions guarantee that Algorithm 4 produces nontrivial knockoffs for all $j$ with probability 1. This is unachievable in general in the discrete case no matter how the sufficient statistic is chosen, as there is always a positive probability (for every $j$ ) that the sufficient statistic takes a value such that $\tilde{\bm{X}}_{j}=\bm{X}_{j}$ is uniquely determined given that sufficient statistic (e.g., if $\bm{X}_{i,j}=1$ for all $i$ ).

One way to ensure $B_{1},\dots,B_{m}$ satisfy the requirements of Algorithm 6 is if assigning each $B_{i}^{c}$ a different color produces a proper coloring of $G$ .999A coloring of $G$ is proper if no adjacent vertices have the same color. The end of Section 3.2.1 listed some common graph structures with known chromatic numbers,101010The chromatic number of a graph $G$ is the minimal $m$ such that $G$ is $m$ -colorable. which subsume many common models including Ising models and Potts models. Although not specified in Section 3.2.1, a Markov chain of order $m-1$ is $m$ -colorable and a planar graph (map) is 4-colorable. Also, for any graph of maximal degree $d$ , a $(d+1)$ -coloring can be found in $O(dp)$ time by greedy coloring (Lewis,, 2016, Chapter 2). In general, both finding the chromatic number and finding a corresponding coloring of a graph $G$ are NP-hard (Garey and Johnson,, 1979), but there exist efficient algorithms that in practice are able to color graphs with a near-optimal number of colors (see Malaguti and Toth, (2010) for a survey).

3.3.2 Refined Constructions for Markov Chains

For Markov chains, we develop two alternative conditional knockoff constructions that take advantage of the Markovian structure. Although we generally expect these constructions to dominate Algorithm 6 when $G$ is a Markov chain, we found the difference in power to be negligible in every simulation we tried, and so we defer these algorithms to Appendix B.4 and only provide a brief summary here.

Suppose the components of $X$ follow a $K$ -state discrete Markov chain, and let $\pi^{(1)}_{k}=\mbox{$ \mathbb{P}\left(X_{1}=k\right) $}$ and $\pi^{(j)}_{k,k^{\prime}}=\mbox{$ \mathbb{P}\left(\left.X_{j}=k^{\prime}\ \right|X_{j-1}=k\right) $}$ be the model parameters. Then the joint distribution for $n$ i.i.d. samples is,

[TABLE]

where $N^{(j)}_{k,k^{\prime}}=\sum_{i=1}^{n}\mbox{$ \mathbf{1}{\left{X{i,j-1}=k,X_{i,j}=k^{\prime}\right}} $}$ . So all the $N^{(j)}_{k,k^{\prime}}$ ’s together form a sufficient statistic, which we denote by $T(\bm{X})$ . As opposed to the statistics $N_{j}(k_{j},\bm{k}_{\{j-1,j+1\}})$ ’s used in Section 3.3.1, $T(\bm{X})$ is minimal, and thus we expect that generating knockoffs conditional on it will be more powerful than knockoffs generated conditional on a non-minimal statistic. Conditional on $T(\bm{X})$ , the columns of $\bm{X}$ still comprise a Markov chain whose distribution can be used to generate knockoffs in two possible ways:

The sequential conditional independent pairs (SCIP) algorithm (Candès et al.,, 2018; Sesia et al.,, 2018) has computational complexity exponential in $n$ , but by splitting the samples into small folds and generating conditional knockoffs separately for each fold, $n$ is artificially reduced and the computation made tractable. 2. 2.

Refined blocking modifies Algorithm 5 by first drawing a new contingency table that is exchangeable with the the three-way contingency table for $(\bm{X}_{j-1},\bm{X}_{j},\bm{X}_{j+1})$ and then sampling $\tilde{\bm{X}}_{j}$ given the new contingency table.

3.3.3 Numerical Examples

We present two simulations, comparing the power of Algorithm 6 with its unconditional counterpart for discrete Markov chains (Sesia et al.,, 2018) and for Ising models (Bates et al.,, 2020).

We remind the reader that the simulation setting is at the beginning of Section 3.

In Figure 4a, the $\bm{x}_{i}\in\{0,1\}^{1000}$ are i.i.d. from an inhomogeneous binary Markov chain with $p=1000$ . The initial distribution is ${\mbox{$ \mathbb{P}\left(X_{1}=0\right) $}=\mbox{$ \mathbb{P}\left(X_{1}=1\right) $}=.5}$ , and the transition probabilities

[TABLE]

are randomly generated as

[TABLE]

where $U_{i}^{(j)}\stackrel{{\scriptstyle\emph{i.i.d.}}}{{\sim}}\text{Unif}([0,1])$ but held fixed across all replications. We implemented Algorithm 6 with $B_{1}$ as the even variables and $B_{2}$ as the odds, with $n_{1}=n_{2}=n/2$ , and used Algorithm 12 (with $Q=2$ ) in Line 3. The number of unknown parameters in the model is $2p-1=1,999$ and all plotted power curves have $n\leq 350$ . Despite the high-dimensionality, conditional knockoffs are nearly as powerful as the unconditional SCIP procedure of Sesia et al., (2018) which requires knowing the exact distribution of $X$ .

In Figure 4b, the $\bm{x}_{i}\in\mathbb{R}^{32\times 32}$ are i.i.d. draws from an Ising model111111We use the coupling from the past algorithm (Propp and Wilson,, 1996) to sample exactly from this distribution. given by:

[TABLE]

where the vertex set $V=[32]\times[32]$ and the edge set $E$ is all the pairs $(s,t)$ such that $\|s-t\|_{1}=1$ . We take $\theta_{s,t}=0.2$ and $h_{s}=0$ . Model (3.5) has $2\times 32\times 31+32^{2}=3008$ parameters, again far larger than any of the sample sizes simulated, yet conditional knockoffs are still nearly as powerful as their unconditional counterparts.121212We use the default subgraph width $w=5$ in Bates et al., (2020) for generating unconditional knockoffs. The conditional knockoffs are generated by Algorithm 6 with two-fold data-splitting ( $m=2$ , vertices are colored by the parity of the sum of their coordinates) and no graph-expanding. Although it is possible to use graph-expanding, the power improvement is negligible because the sample size is quite small relative to the size of the neighborhoods in the expanded graph, resulting in the second round of knockoffs being nearly identical to their original counterparts.

4 Discussion

This paper introduced a way to use knockoffs to perform variable selection with exact FDR control under much weaker assumptions than made in Candès et al., (2018), while retaining nearly as high power in simulations. In fact, our method controls the FDR under arguably weaker assumptions than any existing method (see Section 1.2). The key idea is simple, to generate knockoffs conditional on a sufficient statistic, but finding and proving valid algorithms for doing so required surprisingly sophisticated tools. One particularly appealing property of conditional knockoffs is how it directly leverages unlabeled data for improved power. We conclude with a number of open research questions raised by this paper:

Algorithmic: Perhaps the most obvious question is how to construct conditional knockoffs for models not addressed in this paper. Even for the models in this paper, what is the best way to choose the tuning parameters (e.g., $\bm{s}$ in Algorithm 1, or the blocks $B_{i}$ in Algorithms 4 and 6)?

Robustness: Can techniques like those in Barber et al., (2018) be used to quantify the robustness of conditional knockoffs to model misspecification? Empirical evidence for such robustness is provided in Appendix D.2. Also, it is worth pointing out that there are models for which no ‘small’ sufficient statistic exists, i.e., every sufficient statistic $T(\bm{X})$ has the property that $\bm{X}_{j}\mid\bm{X}_{{\text{-}j}},T(\bm{X})$ is a point mass at $\bm{X}_{j}$ , which forces the conditional knockoffs $\tilde{\bm{X}}_{j}$ to be trivial. In such models where the proposal of this paper can only produce trivial knockoffs, could postulating a distribution and generating knockoffs conditional on some (not-sufficient) statistic still improve robustness to the parameter values in the model, relative to generating knockoffs for the same distribution but unconditionally? See Berrett et al., (2018) for a positive example for the related conditional randomization test.

Power: In this paper we always used unconditional knockoffs as a power benchmark for conditional knockoffs, as it seems intuitive that conditioning on less should result in higher power. Can this be formalized, and/or can the cost of conditioning in terms of power be quantified? Combining this with the previous paragraph, we expect there to be a power–robustness tradeoff that can be navigated by conditioning on more or less when generating knockoffs.

Conditioning: There are reasons other than robustness that one might wish to generate knockoffs conditional on a statistic. For instance, if a model for $\bm{X}$ needs to be checked by observing a statistic of $\bm{X}$ , generating knockoffs conditional on that statistic would guarantee a form of post-selection inference after model selection. Or when data contains variables that confound the variables of interest, it may be desirable to generate knockoffs conditional on those confounders (e.g., by Algorithm 8) in order to control for them. Also, can the conditioning tools and ideas in this paper be used to relax the assumptions of the conditional randomization test, generalizing Rosenbaum, (1984)?

Acknowledgments

D. H. would like to thank Yu Zhao for advice on topological measure theory. L. J. would like to thank Emmanuel Candès, Rina Barber, Natesh Pillai, Pierre Jacob, and Joe Blitzstein for helpful discussions regarding this project. The authors also thank the editors and the three referees for their constructive comments and suggestions.

Appendix A Proofs for Main Text

A.1 Integration of Unlabled Data

Proof of Proposition 2.3.

Denote by $\bm{X}^{(u)}$ the last $n^{(u)}=n^{*}-n$ rows of $\bm{X}^{*}$ . Since the rows of $\bm{X}^{*}$ are independent, $\bm{X}^{(u)}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}(\bm{y},\bm{X})$ . Then by the weak union property, $\bm{X}^{(u)}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{y}\,|\,\bm{X}$ . In addition, the condition that $\tilde{\bm{X}}^{*}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{y}\,|\,(\bm{X},\bm{X}^{(u)})$ and the fact that $\tilde{\bm{X}}$ is a function of $\tilde{\bm{X}}^{*}$ imply $\tilde{\bm{X}}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{y}\,|\,(\bm{X},\bm{X}^{(u)})$ . By the contraction property, these two together show $\tilde{\bm{X}}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{y}\,|\,\bm{X}$ .

Let $\phi:\mathbb{R}^{n^{*}\times 2p}\mapsto\mathbb{R}^{n\times 2p}$ be the mapping that keeps the first $n$ rows of a matrix. We have $[\bm{X},\,\tilde{\bm{X}}]=\phi([\bm{X}^{*},\,\tilde{\bm{X}}^{*}])$ and $[\bm{X},\,\tilde{\bm{X}}]_{\text{swap}(A)}=\phi([\bm{X}^{*},\,\tilde{\bm{X}}^{*}]_{\text{swap}(A)})$ for any subset $A\subseteq[p]$ . The given exchangeability condition implies that

[TABLE]

which is simply $[\bm{X},\,\tilde{\bm{X}}]_{\text{swap}(A)}\,{\buildrel\mathcal{D}\over{=}}\,[\bm{X},\,\tilde{\bm{X}}]\,\Big{|}\,T(\bm{X}^{*})$ . It then follows that

[TABLE]

and we conclude that $\tilde{\bm{X}}$ is a model-X knockoff matrix for $\bm{X}$ . ∎

A.2 Low-Dimensional Gaussian Models

Throughout the appendix, bold-faced capital letters such as $\bm{A}$ are used for any matrix (random or not) except when we need to distinguish between a random matrix and the values it may take, in which case we use bold sans serif letters for the values. For example, we will write $\mathbb{P}\left(\bm{A}=\bm{\mathsf{A}}\right)$ to denote the probability that the random matrix $\bm{A}$ takes the (nonrandom) value $\bm{\mathsf{A}}$ .

This section is planned as follows. Section A.2.1 clarifies a difficulty in the joint uniform distribution on a manifold. Section A.2.2 contains the proof of Theorem 3.1, leaving the proofs of the lemmas in Section A.2.3. Section A.2.4 discusses why a seemingly simpler proof for the theorem fails and thus justifies our technical contribution.

A.2.1 Counterexample for Conditional Uniformity

The following statement is false: ‘If a random variable $A$ is uniform on its support and another random variable $B$ is such that $B\mid A$ is conditionally uniform on its support for every $A$ , with normalizing constant that does not depend on $A$ , then $(A,B)$ is uniform on its support.’ Although this statement seems intuitively true and holds for many simple examples (especially when $A$ and $B$ are both univariate), Figure 5 shows a counterexample. In it, although $X$ is uniform on $(0,1)$ and $(Y,Z)\mid X$ is uniform for every $X$ on a line whose length does not depend on $X$ , the joint distribution of $(X,Y,Z)$ is not uniform on its 2-dimensional support.

A.2.2 Proof of Theorem 3.1

The proof of Theorem 3.1 follows three steps: Lemma A.1 states that the conditional distribution of $[\bm{X},\tilde{\bm{X}}]\mid T(\bm{X})$ is invariant on its support to multiplication by elements of the topological group of orthonormal matrices that have $\bm{1}_{n}$ as a fixed point, Lemma A.2 states that the conditional distribution remains invariant (on the same support) after swapping $\bm{X}_{j}$ and $\tilde{\bm{X}}_{j}$ , and Lemma A.3 states that the invariant measure on the support of $[\bm{X},\tilde{\bm{X}}]\mid T(\bm{X})$ is unique. These three steps combined show that the distributions before and after swapping are the same, and hence $\tilde{\bm{X}}$ is a valid conditional knockoff matrix for $\bm{X}$ .

To streamline notation, we redefine $\hat{\bm{\Sigma}}:=(\bm{X}-\bm{1}_{n}\hat{\bm{\mu}}^{\top})^{\top}(\bm{X}-\bm{1}_{n}\hat{\bm{\mu}}^{\top})$ as $n$ times the sample covariance matrix (it was defined as just the sample covariance matrix in the main text), and redefine $\bm{s}$ such that $\bm{0}_{p\times p}\prec\mbox{$ \mathrm{diag}\left{\bm{s}\right} $}\prec 2\hat{\bm{\Sigma}}$ accordingly. With this new notation, $\bm{L}$ is the Cholesky decomposition such that $\bm{L}^{\top}\bm{L}=2\mbox{$ \mathrm{diag}\left{\bm{s}\right} $}-\mbox{$ \mathrm{diag}\left{\bm{s}\right} $}\hat{\bm{\Sigma}}^{-1}\mbox{$ \mathrm{diag}\left{\bm{s}\right} $}$ . Let $\bm{C}\in\mathbb{R}^{(n-1)\times n}$ be a matrix with orthonormal rows that are also orthogonal to $\bm{1}_{n}$ . Then $\bm{C}^{\top}\bm{C}=\bm{I}_{n}-\bm{1}_{n}\bm{1}_{n}^{\top}/n$ is the centering matrix, $\bm{C}^{\top}\bm{C}\bm{X}=\bm{X}-\bm{1}_{n}\hat{\bm{\mu}}^{\top}$ and $(\bm{C}\bm{X})^{\top}\bm{C}\bm{X}=\hat{\bm{\Sigma}}$ ; note $\bm{C}$ is just a constant, nonrandom matrix. The statistic being conditioned on this this proof is $T(\bm{X})=(\bm{X}^{\top}\bm{1}_{n}/n,(\bm{C}\bm{X})^{\top}\bm{C}\bm{X})=(\hat{\bm{\mu}},\hat{\bm{\Sigma}})$ . For any positive integers $s$ and $t$ such that $s\geq t$ , denote by $\mathcal{O}_{s}$ the group of $s\times s$ orthogonal matrices (also known as the orthogonal group) and denote by $\mathcal{F}_{s,t}$ the set of $s\times t$ real matrices whose columns form an orthonormal set in $\mathbb{R}^{s}$ (also known as the Stiefel manifold).

We will use techniques from topological measure theory to prove Theorem 3.1, specifically on invariant measures (see e.g. Schneider and Weil, (2008, Chapter 13) and (Fremlin,, 2003, Chapter 44)). For readers unfamiliar with the field, the following is a short list of definitions we will use:

•

A group $\mathcal{G}$ is a topological group if it has a topology such that the functions of multiplication and inversion, i.e., $(x,y)\mapsto xy$ and $x\mapsto x^{-1}$ , are continuous.131313A function between two topological spaces is continuous if the inverse image of any open set is an open set.

•

An operation of a group $\mathcal{G}$ on a nonempty set $\mathcal{M}$ is a function $\psi:\mathcal{G}\times\mathcal{M}\mapsto\mathcal{M}$ satisfying $\psi(g,\psi(g^{\prime},x))=\psi(gg^{\prime},x)$ and $\psi(e,x)=x$ . The operation $\psi(g,x)$ is also written as $gx$ when there is no risk of confusion. For any subset $\mathcal{B}\subseteq\mathcal{M}$ and $g\in\mathcal{G}$ , denote by $g\mathcal{B}$ the image under the operation with $g$ , i.e., $g\mathcal{B}=\{\psi(g,x):x\in\mathcal{B}\}$ .

•

An operation $\psi$ is transitive if for any $x,y\in\mathcal{M}$ there exists $g\in\mathcal{G}$ such that $\psi(g,x)=y$ .

•

Suppose $\mathcal{M}$ is a topological space and $\mathcal{G}$ is a topological group, the operation $\psi$ is continuous if $\psi$ , as a function of two arguments, is continuous.

•

Suppose $\mathcal{M}$ is a locally compact metric space. A Borel measure $\rho$ on $\mathcal{M}$ is called $\mathcal{G}$ -invariant if for any $g\in\mathcal{G}$ and Borel subset $\mathcal{B}\subseteq\mathcal{M}$ , it holds that $\rho(\mathcal{B})=\rho(g\mathcal{B})$ .

We can now define the elements of the proof. Suppose $\bm{S}\in\mathbb{R}^{p\times p}$ is a positive definite matrix and $\bm{m}\in\mathbb{R}^{p}$ . Define a metric space

[TABLE]

equipped with the Euclidean metric in the vectorized space, stacked column-wise. By Equation (3.2), it is straightforward to check that if $(\hat{\bm{\mu}},\hat{\bm{\Sigma}})=(\bm{m},\bm{S})$ then $[\bm{X},\tilde{\bm{X}}]\in\mathcal{M}$ .

Define $\mathcal{G}=\{\bm{G}\in\mathcal{O}_{n}:\bm{G}\bm{1}_{n}=\bm{1}_{n}\}$ . It is easy to check that $\mathcal{G}$ is a group whose identity element is $\bm{I}_{n}$ . $\mathcal{G}$ is also a topological group with the induced metric of $\mathbb{R}^{n\times n}$ because matrix inversion and multiplication are continuous.

Define a mapping $\psi:\mathcal{G}\times\mathcal{M}\mapsto\mathcal{M}$ by $\psi(\bm{G},[\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}])=[\bm{G}\bm{\mathsf{X}},\bm{G}\tilde{\bm{\mathsf{X}}}]$ . Note that

[TABLE]

thus for any $[\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}]\in\mathcal{M}$ , we have $\psi(\bm{G},[\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}])\in\mathcal{M}$ . It is also seen that $\psi(\bm{G}_{1}\bm{G}_{2},[\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}])=\psi(\bm{G}_{1},\psi(\bm{G}_{2},[\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}]))$ and $\psi(\bm{I}_{n},[\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}])=[\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}]$ , so $\psi$ is an operation of $\mathcal{G}$ on $\mathcal{M}$ . By the continuity of matrix multiplication, $\psi$ is a continuous operation.

We can now state the three lemmas which comprise the proof.

Lemma A.1 (Invariance).

The probability measure of $[\bm{X},\tilde{\bm{X}}]$ conditional on $\hat{\bm{\mu}}=\bm{m}$ and $\hat{\bm{\Sigma}}=\bm{S}$ is $\mathcal{G}$ -invariant on $\mathcal{M}$ .

Lemma A.2 (Invariance after swapping).

The probability measure of $[\bm{X},\tilde{\bm{X}}]_{\text{\emph{swap}}(j)}$ conditional on $\hat{\bm{\mu}}=\bm{m}$ and $\hat{\bm{\Sigma}}=\bm{S}$ is $\mathcal{G}$ -invariant on $\mathcal{M}$ .

Lemma A.3 (Uniqueness).

The $\mathcal{G}$ -invariant probability measure on $\mathcal{M}$ is unique.

Combining Lemmas A.1, A.2 and A.3 together, we conclude that given $\hat{\bm{\mu}}=\bm{m}$ and $\hat{\bm{\Sigma}}=\bm{S}$ , swapping $\bm{X}_{j}$ and $\tilde{\bm{X}}_{j}$ leaves the distribution of $[\bm{X},\tilde{\bm{X}}]$ unchanged. Since if swapping one column does not change the distribution, then by induction swapping any set of columns will not change the distribution and this completes the proof.

Remark 4.

Although not shown here, one can define the uniform distribution on $\mathcal{M}$ via the Hausdorff measure and show that it is also $\mathcal{G}$ -invariant. Therefore, by the uniqueness of the invariant measure, $[\bm{X},\tilde{\bm{X}}]$ is distributed uniformly on $\mathcal{M}$ .

A.2.3 Proofs of Lemmas

Before proving the lemmas, we introduce some notation and properties for Gaussian matrices. Let $r$ , $s$ , and $t$ be any positive integers. For any matrix $\bm{A}\in\mathbb{R}^{s\times t}$ , denote by $\operatorname{vec}(\bm{A})$ the vector that concatenates its columns, i.e., $(\bm{A}_{1}^{\top},\dots,\bm{A}_{t}^{\top})^{\top}$ . Denote by $\otimes$ the Kronecker product. A $s\times t$ random matrix $\bm{A}$ is a Gaussian random matrix $\bm{A}\sim\mathcal{N}_{s,t}(\bm{M},\bm{\Upsilon}\otimes\bm{\Sigma})$ if $\operatorname{vec}(\bm{A}^{\top})\sim\mathcal{N}(\operatorname{vec}(\bm{M}^{\top}),\bm{\Upsilon}\otimes\bm{\Sigma})$ for some $\bm{M}\in\mathbb{R}^{s\times t}$ and matrices $\bm{\Upsilon}\succeq\bm{0}_{s\times s}$ and $\bm{\Sigma}\succeq\bm{0}_{t\times t}$ .

If $\bm{A}\sim\mathcal{N}_{s,t}(\bm{M},\bm{\Upsilon}\otimes\bm{\Sigma})$ , then for any matrix $\bm{\Gamma}\in\mathbb{R}^{r\times s}$ , $\operatorname{vec}((\bm{\Gamma}\bm{A})^{\top})=(\bm{\Gamma}\otimes\bm{I}_{t})\operatorname{vec}(\bm{A}^{\top})$ is still multivariate Gaussian and

[TABLE]

because $(\bm{\Gamma}\otimes\bm{I}_{t})(\bm{\Upsilon}\otimes\bm{\Sigma})(\bm{\Gamma}\otimes\bm{I}_{t})^{\top}=(\bm{\Gamma}\bm{\Upsilon}\bm{\Gamma}^{\top})\otimes(\bm{I}_{t}\bm{\Sigma}\bm{I}_{t})$ by the mixed-product property and transpose of Kronecker product. When the rows of $\bm{A}$ are i.i.d. samples from a multivariate Gaussian, $\bm{\Upsilon}=\bm{I}_{s}$ and $\bm{M}=\bm{1}_{s}\bm{\mu}^{\top}$ for some $\bm{\mu}\in\mathbb{R}^{t}$ . If further, $\bm{\Gamma}\bm{\Gamma}^{\top}=\bm{I}_{r}$ , then

[TABLE]

We write the Gram–Schmidt orthonormalization as a function $\Psi(\cdot)$ . We will make use of the property that for any $\bm{\Gamma}_{0}\in\mathcal{O}_{s}$ and any matrix $\bm{U}_{0}\in\mathbb{R}^{s\times t}$ (for $s\geq t$ ), it holds that

[TABLE]

See, e.g., Eaton, (1983, Proposition 7.2).

Proof of Lemma A.1.

Define $\nu(\mathcal{B})\,:=\,\mbox{$ \mathbb{P}\left([\bm{X},\tilde{\bm{X}}]\in\mathcal{B}\left|\ \hat{\bm{\mu}}=\bm{m},\hat{\bm{\Sigma}}=\bm{S}\right.\right) $}$ for any Borel subset $\mathcal{B}\subseteq\mathcal{M}$ . For fixed $\bm{G}\in\mathcal{G}$ , we need to show the group operation given $\bm{G}$ , i.e., $g_{\bm{G}}=\psi(\bm{G},\cdot)$ , leaves $\nu$ unchanged. Define $\bm{X}^{\prime}=\bm{G}\bm{X}$ and $\tilde{\bm{X}}^{\prime}=\bm{G}\tilde{\bm{X}}$ . We will show

[TABLE]

By Equation (A.2) and $\bm{G}\bm{1}_{n}=\bm{1}_{n}$ , we have $T(\bm{G}\bm{\mathsf{X}})=T(\bm{\mathsf{X}})$ for any $\bm{\mathsf{X}}\in\mathbb{R}^{n\times p}$ . Applying the property in Equation (A.3), we have

[TABLE]

where we have used $\bm{G}\bm{1}_{n}=\bm{1}_{n}$ and $\bm{G}\in\mathcal{O}_{n}$ . Thus $\bm{X}^{\prime}\,{\buildrel\mathcal{D}\over{=}}\,\bm{X}$ . By Equation (A.4) and the definition of $[\bm{Q},\bm{U}]$ in Algorithm 1,

[TABLE]

Let $\bm{U}^{\prime}=\bm{G}\bm{U}$ . Since $\bm{W}$ is independent of $\bm{X}$ and $\bm{G}\bm{W}$ has the same distribution as $\bm{W}$ , we have $(\bm{X},\bm{W})\,{\buildrel\mathcal{D}\over{=}}\,(\bm{X}^{\prime},\bm{G}\bm{W})$ . This together with Equation (A.5) implies $(\bm{X},\bm{U})\,{\buildrel\mathcal{D}\over{=}}\,(\bm{X}^{\prime},\bm{U}^{\prime})$ . Hence

[TABLE]

and since $T(\bm{X}^{\prime})=T(\bm{X})$ , we conclude

[TABLE]

Now recall we are conditioning on $T(\bm{X})=(\bm{m},\bm{S})$ , and thus also $\bm{s}$ and $\bm{L}$ . By Equation (3.2) and the definition of $\tilde{\bm{X}}^{\prime}$ ,

[TABLE]

which would be the knockoff generated by Algorithm 1 if $\bm{X}^{\prime}$ was observed. As a consequence,

[TABLE]

This shows that for any Borel subset $\mathcal{B}\subseteq\mathcal{M}$ , $\nu(\mathcal{B})=\;\nu(g_{\bm{G}^{-1}}(\mathcal{B}))$ . We conclude that for any $\bm{G}\in\mathcal{G}$ and any Borel subset $\mathcal{B}\subseteq\mathcal{M}$

[TABLE]

that is, the conditional probability measure of $[\bm{X},\tilde{\bm{X}}]$ given $T(\bm{X})=(\bm{m},\bm{S})$ is $\mathcal{G}$ -invariant. ∎

Proof of Lemma A.2.

Without loss of generality, we take $j=1$ . Define a mapping $\phi:\mathbb{R}^{n\times(2p)}\mapsto\mathbb{R}^{n\times(2p)}$ by $\phi([\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}])=[[\tilde{\bm{\mathsf{X}}}_{1},\,\bm{\mathsf{X}}_{\text{-}1}],[\bm{\mathsf{X}}_{1},\,\tilde{\bm{\mathsf{X}}}_{\text{-}1}]]$ , i.e., replacing $\bm{\mathsf{X}}$ and $\tilde{\bm{\mathsf{X}}}$ with $[\tilde{\bm{\mathsf{X}}}_{1},\,\bm{\mathsf{X}}_{\text{-}1}]$ and $[\bm{\mathsf{X}}_{1},\,\tilde{\bm{\mathsf{X}}}_{\text{-}1}]$ , respectively. It is easy to see that $\phi$ is isometric and $\phi^{-1}=\phi$ . Furthermore, we will prove that $\phi$ is a bijective mapping of $\mathcal{M}$ to itself (Lemma A.4). The conditional distribution of $\phi([\bm{X},\tilde{\bm{X}}])$ is the measure $\nu_{\phi}$ on $\mathcal{M}$ such that $\nu_{\phi}(\mathcal{B})=\nu(\phi^{-1}(\mathcal{B}))$ , for any Borel subset $\mathcal{B}\subseteq\mathcal{M}$ . We will show that $\nu_{\phi}$ is $\mathcal{G}$ -invariant on $\mathcal{M}$ (Lemma A.5).

Lemma A.4.

$\phi$ * is a bijective mapping of $\mathcal{M}$ to itself.*

Proof.

$\phi$ is easily seen to be injective, and to show surjectivity, we will first show $\phi(\mathcal{M})\subseteq\mathcal{M}$ . Combining this with $\phi^{-1}=\phi$ gives $\mathcal{M}\subseteq\phi^{-1}(\mathcal{M})=\phi(\mathcal{M})$ , and thus $\phi(\mathcal{M})=\mathcal{M}$ so $\phi$ is surjective from $\mathcal{M}$ to $\mathcal{M}$ . We now complete the proof by showing something even stronger than $\phi(\mathcal{M})\subseteq\mathcal{M}$ , namely the equivalence $\phi([\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}])\in\mathcal{M}\iff[\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}]\in\mathcal{M}$ .

Translating this equivalence to an equality of indicator functions, we need to show that

[TABLE]

where the righthand side is the same as the lefthand side but with $\bm{\mathsf{X}}$ and $\tilde{\bm{\mathsf{X}}}$ replaced with $[\tilde{\bm{\mathsf{X}}}_{1},\,\bm{\mathsf{X}}_{\text{-}1}]$ and $[\bm{\mathsf{X}}_{1},\,\tilde{\bm{\mathsf{X}}}_{\text{-}1}]$ , respectively. First note that for the first and third indicator functions on the lefthand side,

[TABLE]

and exchanging the first term in each product and compressing the products each back into single indicator functions gives $\mathbf{1}_{\left\{[\tilde{\bm{\mathsf{X}}}_{1},\,\bm{\mathsf{X}}_{\text{-}1}]^{\top}\bm{1}_{n}/n=\bm{m}\right\}}$$\mathbf{1}_{\left\{[\bm{\mathsf{X}}_{1},\,\tilde{\bm{\mathsf{X}}}_{\text{-}1}]^{\top}\bm{1}_{n}/n=\bm{m}\right\}}$ , so it just remains to show that

[TABLE]

Again it is useful to rewrite the three indicator functions as products:

[TABLE]

Now if we exchange the terms in the first product with $k>j=1$ with the same terms in the third product, and exchange the terms in the second product with $k>j=1$ with the terms in the third product with $j>k=1$ , we can compress the products each back into single indicator functions again to get $\mathbf{1}_{\left\{(\bm{C}[\tilde{\bm{\mathsf{X}}}_{1},\,\bm{\mathsf{X}}_{\text{-}1}])^{\top}\bm{C}[\tilde{\bm{\mathsf{X}}}_{1},\,\bm{\mathsf{X}}_{\text{-}1}]=\bm{S}\right\}}$$\mathbf{1}_{\left\{(\bm{C}[\bm{\mathsf{X}}_{1},\,\tilde{\bm{\mathsf{X}}}_{\text{-}1}])^{\top}\bm{C}[\bm{\mathsf{X}}_{1},\,\tilde{\bm{\mathsf{X}}}_{\text{-}1}]=\bm{S}\right\}}$$\mathbf{1}_{\left\{(\bm{C}[\bm{\mathsf{X}}_{1},\,\tilde{\bm{\mathsf{X}}}_{\text{-}1}])^{\top}\bm{C}[\tilde{\bm{\mathsf{X}}}_{1},\,\bm{\mathsf{X}}_{\text{-}1}]=\bm{S}-\text{diag}\{\bm{s}\}\right\}}$ . We conclude that $[\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}]\in\mathcal{M}\iff\phi([\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}])\in\mathcal{M}$ . ∎

Lemma A.5.

$\nu_{\phi}$ * is $\mathcal{G}$ -invariant on $\mathcal{M}$ .*

Proof.

For any $\bm{G}\in\mathcal{G}$ , the group operation $g_{\bm{G}}=\psi(\bm{G},\cdot)$ is exchangeable with $\phi$ because

[TABLE]

Thus for any Borel subset $\mathcal{B}\subseteq\mathcal{M}$ ,

[TABLE]

where the third equality follows from Lemma A.1. Thus we conclude that $\nu_{\phi}$ is $\mathcal{G}$ -invariant. ∎

∎

Proof of Lemma A.3.

Before the proof, we list a few results that will be used.

Fact 1.

For an operation $\psi$ of a group $\mathcal{G}$ on a space $\mathcal{M}$ , if there is some $z\in\mathcal{M}$ such that for any $y\in\mathcal{M}$ there exists $g_{y}\in\mathcal{G}$ such that $\psi(g_{y},z)=y$ , then $\psi$ is transitive. This is because for any $x,y\in\mathcal{M}$ , $\psi(g_{x}^{-1},x)=\psi(g_{x}^{-1},\psi(g_{x},z))=\psi(g_{x}^{-1}g_{x},z)=z$ and $\psi(g_{y}g_{x}^{-1},x)=\psi(g_{y},\psi(g_{x}^{-1},x))=\psi(g_{y},z)=y$ . 2. Fact 2.

For any compact Hausdorff141414A topological space is Hausdorff if every two different points can be separated by two disjoint open sets. topological group $\mathcal{G}$ , there exists a finite Borel measure $\nu$ , called a Haar measure, such that for any $g\in\mathcal{G}$ and Borel subset $\mathcal{B}\subseteq\mathcal{G}$ , $\nu(\mathcal{B})=\nu(g\mathcal{B})=\nu(\mathcal{B}g)$ (Fremlin,, 2003, 441E, 442I(c)). As an example, the orthogonal group $\mathcal{O}_{n}$ has a Haar measure (Eaton,, 1983, Chapter 6.2).

The key theorem we use is the following.

Lemma A.6 (Theorem 13.1.5 in Schneider and Weil, (2008)).

Suppose that the compact group $\mathcal{G}$ operates continuously and transitively on the Hausdorff space $\mathcal{M}$ and that $\mathcal{G}$ and $\mathcal{M}$ have countable bases. Let $\nu$ be a Haar measure on $\mathcal{G}$ with $\nu(\mathcal{G})=1$ . Then there exists a unique $\mathcal{G}$ -invariant Borel measure $\rho$ on $\mathcal{M}$ with $\rho(\mathcal{M})=1$ .

Now we are ready to prove Lemma A.3. Note $\mathcal{G}$ and $\mathcal{M}$ are compact subspaces of the vectorized spaces, and Fact 14 ensures the existence of a Haar measure on $\mathcal{G}$ . Since $\psi$ is continuous, as long as $\psi$ is transitive we can apply Lemma A.6 and conclude that the $\mathcal{G}$ -invariant probability measure on $\mathcal{M}$ is unique.

To show $\psi$ is transitive by Fact 1, we first fix a point $[\bm{\mathsf{X}}_{0},\tilde{\bm{\mathsf{X}}}_{0}]$ and then show for any $[\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}]\in\mathcal{M}$ , we can find $\bm{G}\in\mathcal{G}$ such that $\psi(\bm{G},[\bm{\mathsf{X}}_{0},\tilde{\bm{\mathsf{X}}}_{0}])=[\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}]$ .

Part 1. We begin with representing $\tilde{\bm{\mathsf{X}}}$ using the Stiefel Manifold. Define $\mathcal{M}_{1}=\{\bm{\mathsf{X}}\in\mathbb{R}^{n\times p}:\,\bm{\mathsf{X}}^{\top}\bm{1}_{n}/n=\bm{m},\;(\bm{C}\bm{\mathsf{X}})^{\top}\bm{C}\bm{\mathsf{X}}=\bm{S}\}=\{\bm{\mathsf{X}}\in\mathbb{R}^{n\times p}:\,T(\bm{\mathsf{X}})=(\bm{m},\bm{S})\}$ . For any $\bm{\mathsf{X}}\in\mathcal{M}_{1}$ , define

[TABLE]

Let $\bm{Z}_{\bm{\mathsf{X}}}$ be a $n\times(n-1-p)$ matrix whose columns form an orthonormal basis for the orthogonal complement of $\text{span}(\left[\bm{1}_{n},\,\bm{\mathsf{X}}\right])$ . Recall that $\mathcal{F}_{n-1-p,p}$ is the set of $(n-1-p)\times p$ real matrices whose columns form an orthonormal set in $\mathbb{R}^{n-1-p}$ . Define $\varphi_{\bm{\mathsf{X}}}:\mathcal{F}_{n-1-p,p}\mapsto\mathbb{R}^{n\times p}$ by

[TABLE]

The following result tells us that for any $[\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}]\in\mathcal{M}$ , there exists a $\bm{\mathsf{V}}\in\mathcal{F}_{n-1-p,p}$ such that $\tilde{\bm{\mathsf{X}}}=\varphi_{\bm{\mathsf{X}}}(\bm{\mathsf{V}})$ , and thus we are implicitly decomposing $\bm{U}$ from Algorithm 1 into $\bm{Z}_{\bm{X}}\bm{V}$ for some random $\bm{V}$ , and we think of $\bm{\mathsf{V}}$ as a realization of this $\bm{V}$ .

Lemma A.7.

$\varphi_{\bm{\mathsf{X}}}$ * is a bijective mapping from $\mathcal{F}_{n-1-p,p}$ to $\mathcal{M}_{\bm{\mathsf{X}}}$ .*

The proof of Lemma A.7 involves mainly linear algebra and is deferred to the end of this section.

Part 2. We now define $[\bm{\mathsf{X}}_{0},\tilde{\bm{\mathsf{X}}}_{0}]$ . Let the eigenvalue decomposition of $\bm{S}$ be $\bm{G}_{0}\bm{D}^{2}\bm{G}_{0}^{\top}$ , where $\bm{D}$ is a $p\times p$ diagonal matrix with positive non-increasing diagonal entries and $\bm{G}_{0}\in\mathcal{O}_{p}$ . Define a $(n-1)\times p$ matrix $\bm{\mathsf{X}}_{*}$ and a $(n-1)\times(n-1-p)$ matrix $\bm{Z}_{*}$ as

[TABLE]

Then $\bm{Z}_{*}^{\top}\bm{Z}_{*}=\bm{I}_{n-1-p}$ , $\bm{\mathsf{X}}_{*}^{\top}\bm{\mathsf{X}}_{*}=\bm{S}$ and $\bm{\mathsf{X}}_{*}^{\top}\bm{Z}_{*}=\bm{0}$ . Next define

[TABLE]

One can check that $[\bm{\mathsf{X}}_{0},\tilde{\bm{\mathsf{X}}}_{0}]\in\mathcal{M}$ .

Part 3. Now for any $[\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}]\in\mathcal{M}$ , we will find a $\bm{G}\in\mathcal{G}$ such that $\psi(\bm{G},[\bm{\mathsf{X}}_{0},\tilde{\bm{\mathsf{X}}}_{0}])=[\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}]$ .

Let $\bm{Q}_{\bm{\mathsf{X}}}=\bm{C}\bm{\mathsf{X}}\bm{G}_{0}\bm{D}^{-1}$ , which is a $(n-1)\times p$ matrix. Since $(\bm{C}\bm{\mathsf{X}})^{\top}\bm{C}\bm{\mathsf{X}}=\bm{S}$ , we have $\bm{Q}_{\bm{\mathsf{X}}}^{\top}\bm{Q}_{\bm{\mathsf{X}}}=\bm{I}_{p}$ . Thus $\bm{Q}_{\bm{\mathsf{X}}}\in\mathcal{F}_{n-1,p}$ . By Lemma A.7, there is some $\bm{\mathsf{V}}\in\mathcal{F}_{n-1-p,p}$ such that $\tilde{\bm{\mathsf{X}}}=\bm{1}_{n}\bm{m}^{\top}+(\bm{\mathsf{X}}-\bm{1}_{n}\bm{m}^{\top})(\bm{I}_{p}-\bm{S}^{-1}\mbox{$ \mathrm{diag}\left{\bm{s}\right} $})+\bm{Z}_{\bm{\mathsf{X}}}\bm{\mathsf{V}}\bm{L}$ . Let $\bm{Q}_{\tilde{\bm{\mathsf{X}}}}$ be $\bm{C}\bm{Z}_{\bm{\mathsf{X}}}\bm{\mathsf{V}}$ . We will show $\bm{Q}_{\tilde{\bm{\mathsf{X}}}}\in\mathcal{F}_{n-1,p}$ and $\bm{Q}_{\tilde{\bm{\mathsf{X}}}}^{\top}\bm{Q}_{\bm{\mathsf{X}}}=\bm{0}$ :

Because $\bm{Z}_{\bm{\mathsf{X}}}^{\top}\bm{1}_{n}=\bm{0}$ , it holds $\bm{C}^{\top}\bm{C}\bm{Z}_{\bm{\mathsf{X}}}=\bm{Z}_{\bm{\mathsf{X}}}$ . Thus

[TABLE]

In addition, because $\bm{Z}_{\bm{\mathsf{X}}}^{\top}\bm{\mathsf{X}}=\bm{0}$ , it holds that

[TABLE]

Then we can find some $\bm{G}_{*}\in\mathcal{O}_{n-1}$ such that

[TABLE]

Define $\bm{G}=\bm{C}^{\top}\bm{G}_{*}\bm{C}+\bm{1}_{n}\bm{1}_{n}^{\top}/n$ . One can check that $\bm{G}^{\top}\bm{G}=\bm{I}_{n}$ and $\bm{G}\bm{1}_{n}=\bm{1}_{n}$ , and conclude that $\bm{G}\in\mathcal{G}$ . We next show $[\bm{G}\bm{\mathsf{X}}_{0},\bm{G}\tilde{\bm{\mathsf{X}}}_{0}]=[\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}]$ .

We first check

[TABLE]

Next, note that

[TABLE]

and hence it holds that

[TABLE]

Hence the operation $\psi$ is transitive, and the proof is complete. ∎

Proof of Lemma A.7.

The proof takes four steps.

Step 1: $\bm{L}$ is invertible.

Let

[TABLE]

By construction of $\bm{s}$ , $2\mbox{$ \mathrm{diag}\left{\bm{s}\right} $}\succ\bm{0}_{p\times p}$ and

[TABLE]

where the lefthand side of the last line is exactly the Schur complement of $2\mbox{$ \mathrm{diag}\left{\bm{s}\right} $}$ in $\bm{S}_{*}$ , and therefore $\bm{S}_{*}\succ\bm{0}_{p\times p}$ . But since $\bm{S}\succ\bm{0}_{p\times p}$ , the fact that $\bm{S}_{*}\succ\bm{0}_{p\times p}$ implies that the Schur complement of $\bm{S}$ in $\bm{S}_{*}$ is also positive definite:

[TABLE]

and therefore $\bm{L}$ is invertible.

Step 2: $\varphi_{\bm{\mathsf{X}}}(\mathcal{F}_{n-1-p,p})\subseteq\mathcal{M}_{\bm{\mathsf{X}}}$ .

Let $\tilde{\bm{\mathsf{X}}}=\varphi_{\bm{\mathsf{X}}}(\bm{\mathsf{V}})$ for some $\bm{\mathsf{V}}\in\mathcal{F}_{n-1-p,p}$ . First we show $\tilde{\bm{\mathsf{X}}}^{\top}\bm{1}_{n}/n=\bm{m}$ :

[TABLE]

Next we show $(\bm{C}\tilde{\bm{\mathsf{X}}})^{\top}\bm{C}\tilde{\bm{\mathsf{X}}}=\bm{S}$ :

[TABLE]

And finally we show $(\bm{C}\tilde{\bm{\mathsf{X}}})^{\top}\bm{C}\bm{\mathsf{X}}=\bm{S}-\text{diag}\{\bm{s}\}$ :

[TABLE]

We conclude that $\tilde{\bm{\mathsf{X}}}\in\mathcal{M}_{\bm{\mathsf{X}}}$ and therefore $\varphi_{\bm{\mathsf{X}}}(\mathcal{F}_{n-1-p,p})\subseteq\mathcal{M}_{\bm{\mathsf{X}}}$ .

Step 3: $\varphi_{\bm{\mathsf{X}}}$ is injective.

Since $\bm{Z}_{\bm{\mathsf{X}}}^{\top}\left[\bm{1}_{n},\,\bm{X}\right]=\bm{0}$ and $\bm{L}$ is invertible, $\bm{Z}_{\bm{\mathsf{X}}}^{\top}\varphi_{\bm{\mathsf{X}}}(\bm{\mathsf{V}})\bm{L}^{-1}=\bm{\mathsf{V}}$ . Thus $\varphi_{\bm{\mathsf{X}}}$ is injective.

Step 4: $\varphi_{\bm{\mathsf{X}}}$ is surjective.

Let $\tilde{\bm{\mathsf{X}}}\in\mathcal{M}_{\bm{\mathsf{X}}}$ . By the definition of $\bm{Z}_{\bm{\mathsf{X}}}$ , the columns of $\left[\bm{1}_{n},\,(\bm{\mathsf{X}}-\bm{1}_{n}\bm{m}^{\top}),\bm{Z}_{\bm{\mathsf{X}}}\right]$ form a basis of $\mathbb{R}^{n}$ . Hence we can uniquely define $\bm{\alpha}^{\top}\in\mathbb{R}^{1\times p}$ , $\bm{\Lambda}\in\mathbb{R}^{p\times p}$ and $\bm{\Theta}\in\mathbb{R}^{(n-1-p)\times p}$ such that

[TABLE]

First, $\bm{m}=\tilde{\bm{\mathsf{X}}}^{\top}\bm{1}_{n}/n=\bm{\alpha}$ because $\left[(\bm{\mathsf{X}}-\bm{1}_{n}\bm{m}^{\top}),\bm{Z}_{\bm{\mathsf{X}}}\right]^{\top}\bm{1}_{n}=\bm{0}_{(n-1)\times 1}$ .

Next we show $\bm{\Lambda}=\bm{I}_{p}-\bm{S}^{-1}\mbox{$ \mathrm{diag}\left{\bm{s}\right} $}$ :

[TABLE]

And finally, we show $\bm{\Theta}=\bm{\mathsf{V}}\bm{L}$ for some $\bm{\mathsf{V}}\in\mathcal{F}_{n-1-p,p}$ . Using Equation (A.8),

[TABLE]

where again the second equality uses $\bm{C}\bm{1}_{n}=\bm{0}$ and $\bm{C}^{\top}\bm{C}\bm{Z}_{\bm{\mathsf{X}}}=\bm{Z}_{\bm{\mathsf{X}}}$ , the third equality uses $\bm{\Lambda}=\bm{I}_{p}-\bm{S}^{-1}\mbox{$ \mathrm{diag}\left{\bm{s}\right} $}$ and $\bm{Z}_{\bm{\mathsf{X}}}^{\top}\bm{Z}_{\bm{\mathsf{X}}}=\bm{I}_{n-1-p}$ , and the last equality follows from the invertibility of $\bm{L}$ . Define $\bm{\mathsf{V}}\,:=\,\bm{\Theta}\bm{L}^{-1}$ , then the last equality implies $\bm{\mathsf{V}}\in\mathcal{F}_{n-1-p,p}$ . We conclude that $\tilde{\bm{\mathsf{X}}}=\varphi_{\bm{\mathsf{X}}}(\bm{\mathsf{V}})$ . ∎

A.2.4 An Intuitive Proof That Does Not Quite Work

The astute reader may think there is a more straightforward way than the previous subsection to prove Theorem 3.1 using the fact that all the randomness in the conditional knockoffs construction of Algorithm 1 comes from $[\bm{U},\tilde{\bm{U}}]$ which follows the Haar measure on $\mathcal{F}_{n,2p}$ , and this Haar measure has many known properties including swap-invariance. We show here why we were not able to follow this route, and resorted instead to a more technical proof using topological measure theory.

For simplicity, consider the special case where the mean vector is known to be zero, i.e. $\bm{x}_{i}\sim N(0,\bm{I}\otimes\bm{\Sigma})$ . Let $\bm{X}=\bm{U}\bm{D}\bm{V}^{\top}$ be the singular value decomposition of $\bm{X}$ where $\bm{U}\in\mathbb{R}^{n\times p},\bm{D}\in\mathbb{R}^{p\times p},\bm{V}\in\mathbb{R}^{p\times p}$ . It is not hard to see that $\bm{U}$ is uniformly distributed on $\mathcal{F}_{n,p}$ and is independent of $\bm{D}\bm{V}^{\top}$ . This claim implicitly uses the existence of a Haar measure on $\mathcal{F}_{n,p}$ , but this is well-known (we denote this measure by $\operatorname{Unif}\left(\mathcal{F}_{n,p}\right)$ ). Conditioning on $\bm{X}^{T}\bm{X}=\bm{V}^{T}\bm{D}^{2}\bm{V}=\hat{\bm{\Sigma}}$ ,

[TABLE]

Thus in principle, it would be sufficient to construct $\tilde{\bm{X}}$ such that

[TABLE]

which simply requires generating the left singular vectors of $\operatorname{Unif}\left(\mathcal{F}_{n,2p}\right)$ conditioned on $\bm{U}$ being the first $p$ columns. This can be easily achieved by stacking $\bm{W}$ on the right of $\bm{X}$ and calculating the left singular values of $[\bm{X},\bm{W}]$ , which is exactly what is done in Algorithm 3.1.

To prove the validity of this construction, we just need to check that the right hand side of Equation (A.9) is swap-invariant. Indeed, $\operatorname{Unif}\left(\mathcal{F}_{n,2p}\right)$ is easily shown to be swap-invariant, and the matrix multiplying it appears to be swap-invariant as well. However, the matrix square root complicates things. Denote

[TABLE]

To make the argument more precise, suppose that we want to show that swapping $\bm{X}_{1}$ with $\tilde{\bm{X}}_{1}$ does not change the joint distribution of $[\bm{X},\tilde{\bm{X}}]$ . Let $\bm{P}\in\mathbb{R}^{2p\times 2p}$ be the permutation matrix that swaps columns $1$ and $1+p$ of a matrix when multiplied on the right. By Equation (A.9), what we need to show is

[TABLE]

The left hand side equals to $\left(\operatorname{Unif}\left(\mathcal{F}_{n,2p}\right)\,\bm{P}\right)\bm{P}\bm{G}\bm{P}$ . By known properties of the Haar measure, we have that $\operatorname{Unif}\left(\mathcal{F}_{n,2p}\right)\,\bm{P}\,{\buildrel\mathcal{D}\over{=}}\,\operatorname{Unif}\left(\mathcal{F}_{n,2p}\right)$ , and hence Equation (A.10) is equivalent to

[TABLE]

The only way we can see how one might prove this more simply than the proof in our paper is to show that $\bm{P}\bm{G}\bm{P}=\bm{G}$ , i.e., that $\bm{G}$ is swap-invariant.

$\bm{G}$ visually appears to be swap-invariant, and indeed is the square root of a swap-invariant matrix, but the fact that a matrix is swap-invariant does not directly imply that its square root is swap-invariant. The square root of a matrix in general is not unique, so we may hope that there exists (and we can identify) a swap-invariant square root in this case, but in the representation of Equation (A.9), we can actually only use the square root that has $\bm{D}\bm{V}^{\top}$ on its upper left block and has $\bm{0}$ on its bottom left block, in order to match $\bm{X}=\bm{U}\bm{D}\bm{V}^{\top}$ on the left hand side. Therefore, we can actually say with certainty that

[TABLE]

where $\bm{L}^{\top}\bm{L}=2\mbox{$ \mathrm{diag}\left{\bm{s}\right} $}-\mbox{$ \mathrm{diag}\left{\bm{s}\right} $}\hat{\bm{\Sigma}}^{-1}\mbox{$ \mathrm{diag}\left{\bm{s}\right} $}$ is a Cholesky decomposition. However, this matrix cannot be swap-invariant. This is why we were unable to prove swap-exchangeability of $[\bm{X},\tilde{\bm{X}}]$ directly from swap-invariance of $\operatorname{Unif}\left(\mathcal{F}_{n,2p}\right)$ , and were instead forced to prove it directly using topological measure theory. Note that our proof uses similar machinery to the first-principles proof of the known result that $\operatorname{Unif}\left(\mathcal{F}_{n,2p}\right)$ is swap-invariant.

A.3 Gaussian Graphical Models

Proof of Theorem 3.2.

By classical results for the multivariate Gaussian distribution, we have

[TABLE]

where $\bm{\Xi}=\bm{\Sigma}_{B^{c},B}(\bm{\Sigma}_{B,B})^{-1}$ , $\bm{\mu}^{\ast}=\bm{\mu}_{B^{c}}-\bm{\Xi}\bm{\mu}_{B}$ and $(\bm{\Sigma}^{\ast})^{-1}=(\bm{\Sigma}^{-1})_{B^{c},B^{c}}$ . By the condition that $G$ is $n$ -separated by $B$ , $(\bm{\Sigma}^{\ast})^{-1}$ is block diagonal with blocks defined by the $V_{k}$ ’s. Thus $X_{V_{1}},\dots,X_{V_{\ell}}$ are conditionally independent given $X_{B}$ .

To show $[\bm{X},\tilde{\bm{X}}]$ is invariant to swapping $A$ for any $A\subseteq[p]$ , by conditional independence of the $X_{V_{k}}$ ’s, it suffices to show that for any $k\in[\ell]$ and $A_{k}\,:=A\,\cap V_{k}$ ,

[TABLE]

Before proving Equation (A.13), we set up some notation. Let $\bm{\Omega}=\bm{\Sigma}^{-1}$ , then by block matrix inversion (see, e.g., Kollo and von Rosen, (2006, Proposition 1.3.3)),

[TABLE]

and

[TABLE]

Thus $\bm{\Xi}$ can be written as $-\bm{\Omega}_{B^{c},B^{c}}^{-1}\bm{\Omega}_{B^{c},B}$ .

Now fix $k\in[\ell]$ . Let $B_{k}=I_{V_{k}}\,\cap\,B$ . Since $V_{k}$ and $B\setminus B_{k}$ are not adjacent, $\bm{\Omega}_{V_{k},B\setminus B_{k}}$ , and thus $\bm{\Xi}_{V_{k},B\setminus B_{k}}$ , equals $\bm{0}$ . Equation (A.12) implies

[TABLE]

This also implies that

[TABLE]

Since the rows of $\bm{X}_{V_{k}\uplus B_{k}}$ are i.i.d. Gaussian, the validity of Algorithm 8 (see Theorem B.2 in Appendix B.1.3) says that $\tilde{\bm{X}}_{V_{k}}$ generated in Line 2 of Algorithm 2 satisfies

[TABLE]

This together with Equation (A.14) shows Equation (A.13). This completes the proof. ∎

Proof of Proposition 3.3.

This proof will be about Algorithm 10 in Appendix B.2, which is shown there to be equivalent to Algorithm 3. Without loss of generality, we assume $\pi=(1,\dots,p)$ . Denote by $N_{j}^{(h)}$ the set $N_{j}$ in the Algorithm 10 after the $h$ th step. The updating steps of the algorithm ensure $j\notin N_{j}^{(h)}$ for any $j$ and $h$ . Note that $N_{j}$ does not change after the $(j-1)$ th step, i.e., $N_{j}^{(j-1)}=N_{j}^{(j)}=\dots=N_{j}^{(p)}$ .

It suffices to show the following inequality for each connected component $W$ , whose vertex set is denoted by $V$ , of the subgraph induced by deleting $B$ :

[TABLE]

**Part 1. **First note that by definition of $V$ , every element of $I_{V}$ is either in $V$ or $B$ . Now define $F=[p]\setminus(V\uplus(I_{V}\cap\,B))$ . We will show that $k\in F$ will never appear in $N_{j}$ for any $j\in V$ .

Initially, for any $j\in V$ , $N_{j}^{(0)}=I_{j}$ does not intersect $F$ . Suppose $h$ is the smallest integer such that there exists some $j\in V$ such that $N_{j}^{(h)}$ contains some $k\in F$ . By the construction of the algorithm, $h\notin B$ , $j>h$ and $j\in N_{h}^{(h-1)}$ (otherwise $N_{j}^{(h)}$ would not have been altered in the $h$ th step), $k\in N_{h}^{(h-1)}$ (otherwise $k$ could not have entered $N_{j}^{(h)}$ at the $h$ th step), and $h\in N_{j}^{(h-1)}$ (by symmetry of $N_{j}^{(i)}$ and $N_{h}^{(i)}$ for $i<\min(h,j)$ ).

Since $h\in N_{j}^{(h-1)}$ , the definition of $h$ guarantees $h\notin F$ (otherwise $h-1$ would be smaller and satisfy the condition defining $h$ ), and thus $h$ is in either $V$ or $I_{V}\cap B$ . But since $h\notin B$ , we must have $h\in V$ . Now we have shown $k\in N_{h}^{(h-1)}$ , i.e., $F$ intersects $N_{h}$ before the $h$ th step, and $h\in V$ , but this contradicts the definition of $h$ . We conclude that for any $j\in V$ and any $h\in[p]$ , $F\cap\,N_{j}^{(h)}=\emptyset$ and thus $N_{j}^{(p)}\subseteq(I_{V}\cap\,B)\uplus\,(V\setminus\{j\})$ .

Part 2: We now characterize $N_{j}^{(p)}$ . For any $j\in V$ , define

[TABLE]

We will show $L_{j}\subseteq N_{j}^{(p)}$ by induction. This is true for the smallest $j\in V$ because $L_{j}=I_{j}\subseteq N_{j}^{(p)}$ . Now assume $L_{j}\subseteq N_{j}^{(p)}$ for any $j<j_{0}$ (both in $V$ ), we will show $L_{j_{0}}\subseteq N_{j_{0}}^{(p)}$ . For any $v\in L_{j_{0}}$ , if $v\in I_{j_{0}}$ it is trivial that $v\in N_{j_{0}}^{(p)}$ . If $v\in L_{j_{0}}\setminus I_{j}$ , there is a path $(j_{0},j_{1},\dots,j_{m},v)$ in $G$ where $\{j_{i}\}_{i=1}^{m}\subseteq V$ are all smaller than $j_{0}$ . Let $j_{i^{*}}$ be the largest among $\{j_{i}\}_{i=1}^{m}$ . With the two paths $(j_{0},j_{1},\dots,j_{i^{*}})$ and $(j_{i^{*}},\dots,j_{m},v)$ , we have $j_{0},v\in L_{j_{i^{*}}}\subseteq N_{j_{i^{*}}}^{(p)}$ by the inductive hypothesis. Since $j_{0}\in N_{j_{i^{*}}}^{(p)}$ and $j_{0}>j_{i^{*}}$ , in the $j_{i^{*}}$ th step on Line 5, $N_{j_{0}}$ absorbs $N_{j_{i^{*}}}\setminus\{j_{0}\}$ , and it follows that $v\in N_{j_{0}}^{(j_{i^{*}})}$ and thus $v\in N_{j_{0}}^{(p)}$ . We finally conclude that $L_{j_{0}}\subseteq N_{j_{0}}^{(p)}$ , and by induction, $L_{j}\subseteq N_{j}^{(p)}$ for all $j\in V$ .

Part 3. Let $j^{*}$ be the largest number in $V$ . Since $W$ is connected and $j^{*}$ is the largest, the definition of $L_{j^{*}}$ implies $(I_{V}\cap\,B)\uplus(V\setminus\{j^{*}\})=L_{j^{*}}$ . Part 1 showed that $N_{j^{*}}^{(p)}\subseteq(I_{V}\cap\,B)\uplus\,(V\setminus\{j^{*}\})$ and Part 2 showed that $L_{j^{*}}\subseteq N_{j^{*}}^{(p)}$ . Thus $N_{j^{*}}^{(p)}=(I_{V}\cap\,B)\uplus\,(V\setminus\{j^{*}\})$ .

Since $B$ keeps growing, at the $j$ th step of Algorithm 10, the set $\{1,\dots,j-1\}\setminus B$ with the current $B$ is the same as that with the final $B$ . At the $j^{*}$ th step of the algorithm, $N_{j^{*}}\cap(\{1,\dots,j^{*}-1\}\setminus B)$ equals $V\setminus\{j^{*}\}$ (since $j^{*}$ is the largest in $V$ ). Hence

[TABLE]

Since $j^{*}\notin B$ , the requirement in Line 4 and the equality above implies

[TABLE]

and this completes the proof. ∎

A.4 Discrete Graphical Models

Proof of Theorem 3.4.

We first show

[TABLE]

Suppose $j\in B^{c}$ , then $I_{j}\subseteq B$ . By the local Markov property,

[TABLE]

By the weak union property, we have

[TABLE]

which implies $\mbox{$ \mathbb{P}\left(X_{B^{c}}\left|\ X_{B}\right.\right) $}\,=\,\mbox{$ \mathbb{P}\left(X_{j}\left|\ X_{B}\right.\right) $}\mbox{$ \mathbb{P}\left(X_{B^{c}\setminus{j}}\left|\ X_{B}\right.\right) $}$ . Following this logic for the remaining elements of $B^{c}\setminus\{j\}$ , we have $\mbox{$ \mathbb{P}\left(X_{B^{c}}\left|\ X_{B}\right.\right) $}\,=\,\prod_{j\in B^{c}}\mbox{$ \mathbb{P}\left(X_{j}\left|\ X_{B}\right.\right) $}$ , which is then equal to $\prod_{j\in B^{c}}\mbox{$ \mathbb{P}\left(X_{j}\left|\ X_{I_{j}}\right.\right) $}$ because $X_{j}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{B\setminus I_{j}}\mid X_{I_{j}}$ .

Secondly, as justified in Section 3.3.1, the construction of $\tilde{\bm{X}}_{j}$ in Algorithm 5 implies that conditional on $T_{B}(\bm{X})$ , $\tilde{\bm{X}}_{j}$ and $\bm{X}_{j}$ are independent and identically distributed, and thus

[TABLE]

By the law of total probability, it follows that

[TABLE]

Since $\tilde{\bm{X}}_{j}$ is generated without looking at $\bm{X}_{B^{c}\setminus\{j\}}$ , it holds that

[TABLE]

Next we show $(\bm{X}_{B^{c}},\tilde{\bm{X}}_{B^{c}})_{\text{swap}(A)}\,{\buildrel\mathcal{D}\over{=}}\,(\bm{X}_{B^{c}},\tilde{\bm{X}}_{B^{c}})$ for any $A\subseteq B^{c}$ . For any pair of column vectors $(\bm{\mathsf{X}}_{j},\tilde{\bm{\mathsf{X}}}_{j})\in[{K}_{j}]^{n}\times[{K}_{j}]^{n}$ , define

[TABLE]

By Equations (A.15) and (A.17),

[TABLE]

where the third equality (which swaps the order of $\bm{\mathsf{X}}_{j}$ and $\tilde{\bm{\mathsf{X}}}_{j}$ and adds superscript $A$ ’s in the second product) follows from Equation (A.16).

Together with $\tilde{\bm{X}}_{B}=\bm{X}_{B}$ , we conclude $\tilde{\bm{X}}$ is a valid knockoff for $\bm{X}$ . ∎

Appendix B Algorithmic Details

B.1 Low Dimensional Gaussian

B.1.1 Additional Details on Algorithm 1

We begin with the construction of a suitable $\bm{s}$ by extending existing algorithms for computing $\bm{s}$ to our situation. Without loss of generality we assume $\hat{\bm{\Sigma}}_{j,j}=1$ for $j=1,\dots,p$ here; otherwise denote by $\hat{\bm{D}}$ the diagonal matrix with $\hat{\bm{D}}_{j,j}=\hat{\bm{\Sigma}}_{j,j}$ , set $\hat{\bm{\Sigma}}^{0}$ to be $\hat{\bm{D}}^{-1/2}\hat{\bm{\Sigma}}\hat{\bm{D}}^{-1/2}$ , define $\bm{s}^{0}=\hat{\bm{D}}^{-1}\bm{s}$ , and proceed with $\hat{\bm{\Sigma}}$ and $\bm{s}$ replaced by $\hat{\bm{\Sigma}}^{0}$ and $\bm{s}^{0}$ respectively. For any $\epsilon,\delta\in(0,1)$ , we can compute $\bm{s}$ in any of the following ways:

•

Equicorrelated (Barber and Candès,, 2015): Take $s_{j}^{\text{EQ}}=(1-\epsilon)\min\left(2\lambda_{\text{min}}(\hat{\bm{\Sigma}}),1\right)$ for all $j=1,\dots,p$ .

•

Semidefinite program (SDP) (Barber and Candès,, 2015): Take $\bm{s}^{\text{SDP}}$ to be the solution to the following convex optimization:

[TABLE]

•

Approximate SDP (Candès et al.,, 2018): Choose an approximation $\hat{\bm{\Sigma}}_{\text{approx}}$ of $\hat{\bm{\Sigma}}$ and compute $\bm{s}^{\text{approx}}$ by the SDP method as if $\hat{\bm{\Sigma}}=\hat{\bm{\Sigma}}_{\text{approx}}$ . Then set $\bm{s}=\gamma\bm{s}_{\text{approx}}$ where $\gamma$ solves

[TABLE]

Noting that $\tilde{\bm{X}}_{j}^{\top}\bm{X}_{j}/n=\hat{\Sigma}_{j,j}-s_{j}$ , it will always be preferable to take $\epsilon$ as small as possible (for all methods), so that $\bm{s}$ is as large as possible and $\bm{X}_{j}$ and $\tilde{\bm{X}}_{j}$ are as different as possible. For the SDP method, the lower bound $\delta$ can be set to be $s_{j}^{\text{EQ}}$ multiplied by a small number, e.g., $\delta=0.1\cdot 2\lambda_{\text{min}}(\hat{\bm{\Sigma}})$ , to guarantee feasibility; this choice is used in the simulations in Sections 3.1 and 3.2.

We now prove the computational complexity of Algorithm 1. The Cholesky decomposition takes $O(p^{3})$ operations and the Gram–Schmidt orthonormalization takes $O(np^{2})$ operations. If $\bm{s}$ is computed by the Equicorrelated method whose complexity is no larger than $O(p^{3})$ , the overall complexity of Algorithm 1 is $O(np^{2})$ .

B.1.2 Gaussian Knockoffs with Known Mean

Algorithm 7 is a slight modification of Algorithm 1 for mulitvariate Gaussian models with mean parameter $\bm{\mu}$ known. The proof of its validity requires only minor modification of the proof of Theorem 3.1, and is thus omitted.

B.1.3 Partial Gaussian Knockoffs with Fixed Columns

Consider the case where some of the variables are known to be relevant and thus do not need to have knockoffs generated for them. Let $B\subseteq[p]$ be the set of variables that no knockoffs are needed for, so we only want to construct knockoffs for variables in $V=B^{c}$ , i.e., to generate $\tilde{\bm{X}}_{V}$ such that for any subset $A\subseteq V$ ,

[TABLE]

Algorithm 8 provides a way to generate such knockoffs. We can find its computational complexity as follows. Fitting the least squares in Line 1 takes $O(n|B|^{2}|V|)$ , computing $\hat{\bm{\Sigma}}$ takes $O(n|V|^{2})$ , both the most efficient construction of $\bm{s}$ and inverting $\hat{\bm{\Sigma}}$ take $O(|V|^{3})$ , and the Gram–Schmidt orthonormalization takes $O(n(1+|B|+2|V|)^{2})$ . Hence the overall computational complexity is $O\left(n|B|^{2}|V|+n|V|^{2}\right)$ .

The validity of Algorithm 8 relies on its equivalence to a straightforward but slow algorithm, Algorithm 9. We first show the validity of Algorithm 9 and then show the equivalence.

Proposition B.1.

Algorithm 9 generates valid knockoff for $\bm{X}_{V}$ conditional on $\bm{X}_{B}$ .

Proof.

By classical results for the multivariate Gaussian distribution, we have

[TABLE]

where $\bm{\Xi}=\bm{\Sigma}_{V,B}(\bm{\Sigma}_{B,B})^{-1}$ , $\bm{\mu}^{\ast}=\bm{\mu}_{V}-\bm{\Xi}\bm{\mu}_{B}$ and $(\bm{\Sigma}^{\ast})^{-1}=(\bm{\Sigma}^{-1})_{V,V}$ .

We want to show that for any $A\subseteq V$ ,

[TABLE]

For $n$ i.i.d. samples,

[TABLE]

By the definition of $\bm{Q}$ in Algorithm 2, $\bm{Q}^{\top}[\bm{1}_{n},\bm{X}_{B}]=\bm{0}_{(n-1-|B|)\times(1+|B|)}$ and $\bm{Q}^{\top}\bm{Q}=\bm{I}_{n-1-|B|}$ . This together with the property in Equation (A.3) implies

[TABLE]

Since $n-1-|B|\geq 2|V|$ , Algorithm 7 can be used to generate knockoffs $\bm{J}$ for $\bm{Q}^{\top}\bm{X}_{V}$ , which satisfies that

[TABLE]

and thus

[TABLE]

Adding $[\bm{Q}_{\perp}\bm{Q}_{\perp}^{\top}\bm{X}_{V},\bm{Q}_{\perp}\bm{Q}_{\perp}^{\top}\bm{X}_{V}]$ , which is trivially invariant to swapping, to both sides and using $\bm{I}_{n}=\bm{Q}\bm{Q}^{\top}+\bm{Q}_{\perp}\bm{Q}_{\perp}^{\top}$ and the definition of $\tilde{\bm{X}}_{V}$ in Line 3 of Algorithm 2, we have

[TABLE]

Since this holds for any $A\subseteq V$ , $\tilde{\bm{X}}_{V}$ is a valid knockoff matrix for $\bm{X}_{V}$ conditional on $\bm{X}_{B}$ . ∎

Theorem B.2.

Algorithm 8 generates valid knockoffs for $\bm{X}_{V}$ conditional on $\bm{X}_{B}$ .

Proof.

It suffices to show that if the same $\bm{s}$ and $\bm{L}$ in Algorithm 8 are used to generate $\bm{J}$ in Line 2 of Algorithm 9, then the output $\tilde{\bm{X}}_{V}$ in Algorithm 8 and the output in Algorithm 9, which is denoted by $\tilde{\bm{X}}_{V}^{\prime}$ to avoid confusion, have the same conditional distribution given $\bm{X}_{B}$ and $\bm{X}_{V}$ .

We write the Gram–Schmidt orthonormalization as a function $\Psi(\cdot)$ . Let $b=1+|B|$ and $d=|V|$ . By assumption, $b+2d\leq n$ .

By the definition of $\bm{Q}$ and $\bm{Q}_{\perp}$ in Line 1 of Algorithm 9, we have

[TABLE]

First, we express $\tilde{\bm{X}}_{V}^{\prime}$ in a similar form as $\tilde{\bm{X}}_{V}$ in Line 5 of Algorithm 8. The conditional knockoff matrix for $\bm{Q}^{\top}\bm{X}_{V}$ generated by Algorithm 7 (with $\bm{\mu}=\bm{0}_{(n-b)\times 1}$ ) is given by

[TABLE]

where $\hat{\bm{\Sigma}}^{\prime}=\bm{X}_{V}^{\top}\bm{Q}\bm{Q}^{\top}\bm{X}_{V}=\bm{R}^{\top}\bm{R}=\hat{\bm{\Sigma}}$ and $\bm{U}^{\prime}$ is the last $d$ columns of the Gram–Schmidt orthonormalization of $[\bm{Q}^{\top}\bm{X}_{V},\bm{W}^{\prime}]$ with $\bm{W}^{\prime}\sim\mathcal{N}_{n-b,d}(\bm{0},\bm{I}_{n-b}\otimes\bm{I}_{d})$ independent of $\bm{X}_{V}$ and $\bm{X}_{B}$ . Hence we have

[TABLE]

It suffices to show $\bm{U}$ in Line 4 of Algorithm 8 is distributed the same as $\bm{Q}\bm{U}^{\prime}$ conditional on $\bm{X}$ .

Without loss of generality (by choosing $\bm{Q}_{\perp}$ in Line 1 of Algorithm 9), assume the Gram–Schmidt orthonormalization of $[\bm{1}_{n},\bm{X}_{B},\bm{X}_{V}]$ is $[\bm{Q}_{\perp},\bm{M}]$ , where $\bm{M}$ is a $n\times d$ matrix. Hence $\text{span}(\bm{M})=\text{span}(\bm{Q}\bm{Q}^{\top}\bm{X}_{V})$ . Let $\bm{Z}$ be a $(n-b)\times(n-b-d)$ matrix whose columns form an orthonormal basis for the orthogonal complement of $\text{span}(\bm{Q}^{\top}\bm{X}_{V})$ .

Characterizing $\bm{U}$ : Let $\bm{\Gamma}=[\bm{Q}_{\perp},\bm{M},\bm{Q}\bm{Z}]$ . Since $\bm{Z}^{\top}\bm{Q}^{\top}\bm{X}_{V}=\bm{0}$ , we have $\bm{Z}^{\top}\bm{Q}^{\top}\bm{Q}\bm{Q}^{\top}\bm{X}_{V}=\bm{0}$ and thus $\bm{Z}^{\top}\bm{Q}^{\top}\bm{M}=\bm{0}$ . Together with $\bm{Q}_{\perp}^{\top}\bm{Q}\bm{Z}=\bm{0}$ and $(\bm{Q}\bm{Z})^{\top}\bm{Q}\bm{Z}=\bm{I}_{n-b-d}$ , we have $\bm{\Gamma}\in\bm{O}_{n}$ .

Using Equation (A.4),

[TABLE]

where the first equality is due to the fact that Gram–Schmidt orthonormalization treats the columns of its inputs sequentially. An elementary calculation shows

[TABLE]

thus

[TABLE]

Using the definition of $\bm{\Gamma}$ and Equations (B.1.3) and (B.9), we conclude

[TABLE]

which implies

[TABLE]

Noting that $\bm{Z}^{\top}\bm{Q}^{\top}\bm{Q}\bm{Z}=\bm{I}_{n-b-d}$ and $\bm{W}\sim\mathcal{N}_{n,d}(\bm{0},\bm{I}_{n}\otimes\bm{I}_{d})$ , Equation (A.3) implies $\bm{Z}^{\top}\bm{Q}^{\top}\bm{W}\sim\mathcal{N}_{n-b-d,d}(\bm{0},\bm{I}_{n-b-d}\otimes\bm{I}_{d})$ . By the classic result in Eaton, (1983, Proposition 7.2), the conditional distribution of $\Psi(\bm{Z}^{\top}\bm{Q}^{\top}\bm{W})$ given $(\bm{X}_{B},\bm{X}_{V})$ is the unique $\mathcal{O}_{n-b-d}$ -invariant probability measure on $\mathcal{F}_{n-b-d,d}$ .

Characterizing $\bm{U}^{\prime}$ : Let $\bm{Z}_{\perp}\in\mathbb{R}^{(n-b)\times d}$ be the Gram–Schmidt orthonormalization of $\bm{Q}^{\top}\bm{X}_{V}$ , and thus $\bm{Z}^{\top}\bm{Z}_{\perp}=\bm{0}$ . Let $\bm{\Gamma}_{z}=[\bm{Z}_{\perp},\bm{Z}]$ , then $\bm{\Gamma}_{z}\in\mathcal{O}_{n-b}$ . Again using the properties of Gram–Schmidt orthonormalization,

[TABLE]

Since

[TABLE]

it holds that

[TABLE]

Hence $\bm{U}^{\prime}=\bm{Z}\Psi(\bm{Z}^{\top}\bm{W}^{\prime})$ by combining Equations (B.10) and (B.11). As before, we can conclude that the conditional distribution of $\Psi(\bm{Z}^{\top}\bm{W}^{\prime})$ given $(\bm{X}_{B},\bm{X}_{V})$ is the unique $\mathcal{O}_{n-b-d}$ -invariant probability measure on $\mathcal{F}_{n-b-d,d}$ .

Combining the two parts above and using the uniqueness of the invariant measure, we conclude that

[TABLE]

Using the definition of $\tilde{\bm{X}}_{V}$ in Line 4 and Equation (B.7), it follows that $\tilde{\bm{X}}_{V}\,{\buildrel\mathcal{D}\over{=}}\,\tilde{\bm{X}}_{V}^{\prime}\mid(\bm{X}_{B},\bm{X}_{V})$ . ∎

B.2 Gaussian Graphical Models

The computational complexity of Algorithm 2 can be shown by summing up the computational complexity of Algorithm 8 in Line 2 for individual connected components, which is $O\left(n\left|I_{V_{k}}\,\cap\,B\right|^{2}\left|V_{k}\right|+|V_{k}|^{2}\right)$ , as shown in Appendix B.1.3. Its upper bound is due to the facts that $\sum_{k=1}^{\ell}|V_{k}|\leq p$ and $\max_{1\leq k\leq\ell}|V_{k}|\leq n^{\prime}$ .

B.2.1 Greedy Search for a Blocking Set

Algorithm 10 is the virtual implementation of Algorithm 3. In Line 5 of Algorithm 10, we only need to keep track of $N_{j}$ the neighborhood of each unvisited $j$ in $\bar{G}$ among the vertices in $[p]$ . This is because if $k\in N_{j}$ and $\tilde{k}$ exists in $\bar{G}$ then it is guaranteed by Algorithm 3 that $\tilde{k}$ is a neighbor of $j$ in $\bar{G}$ , and $j$ is a neighbor of both $k$ and $\tilde{k}$ . This also implies that $|N_{j}\cap\{\pi_{1},\dots,\pi_{t-1}\}\setminus B|$ equals the size of the neighborhood of $j$ in $\bar{G}$ among the knockoff vertices. Also note that the neighborhood of a visited vertex is no longer used in Line 4 of Algorithm 3, therefore the update step in Line 5 of Algorithm 10 can be restricted to the unvisited $k$ ’s. In the following, we use the equivalence between Algorithm 3 and Algorithm 10 to prove the properties of Algorithm 3.

The following proposition shows that if the tail of the input permutation to Algorithm 3 is already a blocking set of the graph, then the output from the algorithm is a subset of this blocking set. This property allows one to refine a known but large blocking set (e.g., one could apply Algorithm 3 to the blocking set from Example 2 in Appendix B.2.3).

Proposition B.3.

Suppose $n^{\prime}$ and $\pi$ are the inputs of Algorithm 3, which returns a blocking set $B$ . If $G$ is $n^{\prime}$ -separated by $\{\pi_{m+1},\dots,\pi_{p}\}$ for some $m\in[p]$ , then $\pi_{1},\dots,\pi_{m}$ will not be in $B$ .

Proof.

In this proof, we use the equivalence between Algorithm 3 and Algorithm 10. Let $D=\{\pi_{m+1},\dots,\pi_{p}\}$ . Without loss of generality, re-index the variables so that $\pi_{j}=j$ for every $j\in[p]$ , and thus $D=\{m+1,\dots,p\}$ . Denote by $N_{j}^{(h)}$ the set $N_{j}$ in Algorithm 10 after the $h$ th step, as in the proof of Proposition 3.3. Let $W$ be any of the connected components of the subgraph induced by deleting $D$ , and $V$ be the vertex set of $W$ . Then $V\subseteq\{1,\dots,m\}$ .

Part 1. We first show that $N_{j}^{(h)}\subseteq V\uplus\,(I_{V}\cap D)$ for any $j\in V$ and $h\in[p]$ . The proof is similar to Part 1 in the proof of Proposition 3.3. By definition of $V$ , every element of $I_{V}$ is either in $V$ or $D$ . Define $F=[p]\setminus(V\uplus(I_{V}\cap\,D))$ . It suffices to show that $k\in F$ will never appear in $N_{j}$ for any $j\in V$ .

Initially, for any $j\in V$ , $N_{j}^{(0)}=I_{j}$ does not intersect $F$ . Suppose $h$ is the smallest integer such that there exists some $j\in V$ such that $N_{j}^{(h)}$ contains some $k\in F$ . By the construction of the algorithm, $j>h$ and $j\in N_{h}^{(h-1)}$ (otherwise $N_{j}^{(h)}$ would not have been altered in the $h$ th step), $k\in N_{h}^{(h-1)}$ (otherwise $k$ could not have entered $N_{j}^{(h)}$ at the $h$ th step), and $h\in N_{j}^{(h-1)}$ (by symmetry of $N_{j}^{(i)}$ and $N_{h}^{(i)}$ for $i<\min(h,j)$ ). The fact that $h<j\leq m$ implies $h\notin D$ . Since $h\in N_{j}^{(h-1)}$ , the definition of $h$ guarantees $h\notin F$ (otherwise $h-1$ would be smaller and satisfy the condition defining $h$ ), and thus $h$ is in either $V$ or $I_{V}\cap D$ . But since $h\notin D$ , we must have $h\in V$ . Now we have shown $k\in N_{h}^{(h-1)}$ , i.e., $F$ intersects $N_{h}$ before the $h$ th step, and $h\in V$ , but this contradicts the definition of $h$ . We conclude that for any $j\in V$ and any $h\in[p]$ , $F\cap\,N_{j}^{(h)}=\emptyset$ and thus $N_{j}^{(j-1)}\subseteq(I_{V}\cap\,D)\uplus\,(V\setminus\{j\})$ .

Part 2. For any $j\in V$ , at the $j$ th step of Algorithm 10, $N_{j}\cap\{1,\dots,j-1\}\subseteq V\setminus\{j\}$ by the definition of $D$ . Hence we have

[TABLE]

where the last inequality is because of the condition that $G$ is $n^{\prime}$ -separated by $D$ . Thus the requirement in Line 4 of Algorithm 10 is satisfied and $j$ is not in the blocking set.

Finally, since $j$ and $W$ are arbitrary, we conclude that any vertex in $\{1,\dots,m\}$ is not blocked. ∎

B.2.2 Searching for Blocking Sets

Given any $m$ , Algorithm 11 performs a randomized greedy search for the blocking sets $B_{i}$ . Although there is no guarantee that the $B_{i}$ ’s found by Algorithm 11 satisfy $\bigcap\limits_{i=1}^{m}B_{i}=\emptyset$ , one can subsequently check whether $\eta_{j}=m$ for any $j\in[p]$ , in which case the algorithm can be run again. Inspecting the vertices with $\eta_{j}=m$ may reveal the difficulties of blocking for this graph. Changing the inputs $m$ and $n^{\prime}$ may also help.

B.2.3 Examples of $(m,n)$ -Coverable Graphs

Example 1 (Time-inhomogeneous Autoregressive Models ).

Consider a time-inhomogeneous Gaussian AR( $r$ ) model (assuming151515When $r=0$ , the graph is isolated and is $(1,n)$ -coverable for any $n\geq 3$ . $r\geq 1$ ), so that the sparsity pattern $E=\{(i,j):1\leq|i-j|\leq r\}$ . Suppose $n\geq 2+8r$ . A simple choice of blocking sets is given as follows. Let $d=\lfloor(n-2)/8\rfloor$ , then $d\geq r$ . Let $B_{1}=[p]\cap\{kd+i:k\text{ odd, and }i=1,\dots,d\}$ and $B_{2}=[p]\cap\{kd+i:k\text{ even, and }i=1,\dots,d\}$ . Any connected component $W$ of the subgraph that deletes $B_{1}$ is no larger than $d$ and $W$ ’s vertices $V$ satisfy $|I_{V}\cap B_{1}|\leq 2r$ , so $G$ is $(2d+2r+1)$ -separated by $B_{1}$ and $2d+2r+1\leq n/2$ . The same holds for $B_{2}$ . Note that $[p]=B_{1}^{c}\cup B_{2}^{c}$ , thus the graph is $(2,n)$ -coverable.

Example 2 ( $d$ -dimensional Square-lattice Models ).

Consider a finite subset of the $d$ -dimensional lattice $\mathbb{Z}^{d}$ where pairs of vertices with distance 1 are adjacent. Suppose $n\geq 6+4d$ , one could take $B_{1}$ as the grid points whose coordinates sum up to an odd number, and $B_{2}$ as the complement of $B_{1}$ . The subgraph that deletes $B_{1}$ (or $B_{2}$ ) is isolated and each vertex has a neighborhood of size $2d$ , so the graph is $(3+2d)$ -separated by $B_{1}$ (or $B_{2}$ ). Since $3+2d\leq n/2$ , the graph is $(2,n)$ -coverable.

Example 3.

Consider a $m$ -colorable graph $G$ . Let each of $V_{1},\dots,V_{m}$ be the vertex set of the same color. For any $i\in[m]$ , the subgraph that deletes $B_{i}\,:=\,\cup_{\ell\neq i}V_{\ell}$ is the subgraph that restricts on $V_{i}$ , of which each vertex is isolated. Thus $G$ is $(1+2+\max_{v\in V_{i}}|I_{v}|)$ -separated by $B_{i}$ . If $n\geq\sum_{i\in[m]}(3+\max_{v\in V_{i}}|I_{v}|)$ , the graph is $(m,n)$ -coverable. Note this subsumes Example 2 which has $m=2$ , but also applies to many other graphs such as forests, stars, and circles.

B.3 Discrete Graphical Models

B.3.1 Details about the Algorithms

We begin by proving the computational complexity of Algorithm 5. For each $j\in B^{c}$ , enumerating all nonempty configurations of $\bm{k}_{I_{j}}$ takes no more than $\prod\limits_{\ell\in I_{j}}{K}_{\ell}$ operations by checking each $\bm{k}_{I_{j}}$ or $n|I_{j}|$ operations by checking each observed $X_{i,I_{j}}$ . The random permutation takes no more than $n$ steps in total, so the overall complexity is $O\left(\sum\limits_{j\in B^{c}}(n+\min(\prod\limits_{\ell\in I_{j}}K_{\ell},n|I_{j}|))\right)$ .

As mentioned at the beginning of Section 3.3, we can generate knockoffs without assuming the covariate categories being finite. First of all, with infinite ${K}_{\ell}$ ’s, Algorithm 5 can still be used since in Line 3 it is only needed to enumerate those $\bm{k}_{I_{j}}$ actually appearing in the observed data, which is at most $n$ . Furthermore, the proof of Theorem 3.4 does not require the ${K}_{\ell}$ ’s to be finite.

B.3.2 Graph Expanding

As mentioned in Section 3.3.1, in Algorithm 5 variables in $B$ are blocked and their knockoffs are trivial. One way to mitigate this drawback is to run multiple times of Algorithm 5 with expanded graphs that include the generated knockoff variables.

Specifically, denote by $\bar{G}$ a graph being augmented from $G$ . For each $j\in B^{c}$ , we add an edge between every pair of $j$ ’s neighbors and add to $\bar{G}$ the ‘knockoff vertex’ $\tilde{j}$ which has the same neighborhood as $j$ . One can show that $[\bm{X},\tilde{\bm{X}}_{B^{c}}]$ is locally Markov w.r.t. the new graph. Applying Algorithm 5 to $[\bm{X},\tilde{\bm{X}}_{B^{c}}]$ with graph $\bar{G}$ but with a different global cut set $\bar{B}$ which pre-includes $B^{c}$ and also the knockoff vertices, we can generate knockoffs for some of the variables that have been blocked in the first run. One can continue to expand the graph to include the new knockoff variables, although the neighborhoods may become so large that the knockoff variables generated are constrained (through conditioning on these large neighborhoods) to be identical to their corresponding original variables. Algorithm 12 formally describes this process, whose validity is guaranteed by Theorem B.4.

Theorem B.4.

Algorithm 12 generates valid knockoff for model (3.4).

Proof of Theorem B.4.

We will first define some notation to describe the process of graph-expanding, and then write down the joint probability mass function. Finally, we show the p.m.f. remains unchanged when swapping one variable, which suffices to prove the theorem by induction.

Part 1. To streamline the notation, we redefine $Q$ as the number of steps that have actually been taken to expand the graph in Algorithm 5 (rather than the input value). For each $q\in[Q]$ , denote by $G^{(q)}$ the augmented graph and by $B^{(q)}$ the blocking set used in the $q$ th run of Algorithm 5. Let $V^{(q)}=[p]\setminus B^{(q)}$ . Also denote by $D^{(q)}$ the variables for which knockoffs have already been generated before the $q$ th run of Algorithm 5. Then $D^{(r)}=\bigcup_{q=1}^{r-1}V^{(q)}$ for any $r\geq 2$ and $D^{(1)}=\emptyset$ . Let $I_{j}^{(q)}$ be the neighborhood of $j$ in $G^{(q)}$ . For ease of notation, we neglect to write the ranges of $k_{j}$ and $\bm{k}_{A}$ when enumerating them in equations (e.g., taking a product over all their possible values). For a $n\times c$ matrix $\bm{Z}$ and integers $k_{1},\dots,k_{c}$ , let $\mbox{$ \varrho\left(\bm{Z},k_{1},\dots,k_{c}\right) $}\;:=\;\sum_{i=1}^{n}\mbox{$ \mathbf{1}{\left{Z{i,1}=k_{1},\dots,Z_{i,c}=k_{c}\right}} $}$ , i.e., the number of rows of $\bm{Z}$ that equal the vector $(k_{1},\dots,k_{c})$ .

Part 2. For any $q\in[Q]$ and $j\in V^{(q)}$ , the neighborhood $I_{j}^{(q)}$ consists of three parts:

$([p]\cap I_{j}^{(q)})\setminus D^{(q)}$ : the neighbors in $[p]$ for which no knockoffs have been generated, 2. 2.

$I_{j}^{(q)}\cap D^{(q)}$ : the neighbors in $[p]$ for which knockoffs have been generated, and 3. 3.

$\{\tilde{\ell}:\ell\in I_{j}^{(q)}\cap D^{(q)}\}$ : the neighbors that are knockoffs.

The generation of $\tilde{\bm{X}}_{j}$ by Algorithm 5 is to sample uniformly from all vectors in $[{K}_{j}]^{n}$ such that the contingency table for variable $j$ and its neighbors in $I_{j}^{(q)}$ remains the same if $\bm{X}_{j}$ is replaced by any of these vectors. Define

[TABLE]

and then

[TABLE]

Denote the probability mass function of $X$ by $f(\bm{\mathsf{x}})$ . The joint probability mass of the distribution of $[\bm{X},\tilde{\bm{X}}]$ is

[TABLE]

where the product is partitioned into three parts: the distribution of $\bm{X}$ , the distributions of the knockoff columns generated in each step and the indicator functions for the variables that have no knockoffs generated within the $Q$ steps.

Part 3. It suffices to show that for any $\ell\in[p]$ ,

[TABLE]

If both sides of Equation (B.13) equal zero, it holds trivially. Without loss of generality, we will prove this equation under the assumption that the left hand side is non-zero. One can redefine $[\bm{\mathsf{X}}^{\prime},\tilde{\bm{\mathsf{X}}}^{\prime}]=[\bm{\mathsf{X}},\tilde{\bm{\mathsf{X}}}]_{{\text{swap}(\ell)}}$ and apply the same proof when assuming the right hand side is non-zero.

First, suppose $\ell\in[p]\setminus(\bigcup\limits_{q=1}^{Q}V^{(q)})$ . Since the left hand side is non-zero, by Equation (B.12), $\tilde{\bm{\mathsf{X}}}_{\ell}=\bm{\mathsf{X}}_{\ell}$ and the p.m.f. does not change when swapping $\bm{\mathsf{X}}_{\ell}$ with $\tilde{\bm{\mathsf{X}}}_{\ell}$ .

Second, suppose $\ell\in V^{(q_{\ell})}$ for some $q_{\ell}\in[Q]$ . Since the left hand side of Equation (B.13) is non-zero, then in Equation (B.12), the indicator function in the second part with $q=q_{\ell}$ and $j=\ell$ being non-zero indicates

[TABLE]

The only difference between the two sides of Equation B.14 is that $\tilde{\bm{\mathsf{X}}}_{\ell}$ is replaced by $\bm{\mathsf{X}}_{\ell}$ in the first columns of the matrix. Such a difference will keep appearing in the equations that will be showed below.

In the following, we show that everywhere $\bm{\mathsf{X}}_{\ell}$ or $\tilde{\bm{\mathsf{X}}}_{\ell}$ appears in the first or second part of the product in Equation (B.12) remains unchanged when swapping $\bm{\mathsf{X}}_{\ell}$ and $\tilde{\bm{\mathsf{X}}}_{\ell}$ .

As shown in Section 3.3, there exist some functions $\psi_{\ell}$ and $\theta_{\ell}(k_{\ell},\bm{k}_{I_{\ell}})$ ’s such that

[TABLE]

Note that the initial neighborhood $I_{\ell}\subseteq I_{\ell}^{(q_{\ell})}$ , by summing over Equation (B.14), one can conclude that

[TABLE]

for all $k_{\ell},\bm{k}_{I_{\ell}}$ . Thus $\prod_{i=1}^{n}f(\bm{\mathsf{x}}_{i})$ remains unchanged when swapping $\bm{\mathsf{X}}_{\ell}$ and $\tilde{\bm{\mathsf{X}}}_{\ell}$ . 2. 2.

For any $j$ such that $j\in V^{(q_{j})}$ and $\ell\in I_{j}^{(q_{j})}$ , the second part of the product in Equation B.12 involves $\bm{\mathsf{X}}_{\ell}$ or $\tilde{\bm{\mathsf{X}}}_{\ell}$ with the indices $q_{j}$ and $j$ . Since $B^{(q_{j})}$ is a blocking set, we have $q_{j}\neq q_{\ell}$ .

(a)

If $q_{j}>q_{\ell}$ , then $\ell\in I_{j}^{(q_{j})}\cap D^{(q_{j})}$ . Note that swapping $\bm{\mathsf{X}}_{\ell}$ with $\tilde{\bm{\mathsf{X}}}_{\ell}$ only changes the order of the dimensions of the contingency table formed by $\bm{\mathsf{X}}_{j}$ and $\left[\bm{\mathsf{X}}_{[p]\cap I_{j}^{(q_{j})}},\,\tilde{\bm{\mathsf{X}}}_{I_{j}^{(q_{j})}\cap D^{(q_{j})}}\right]$ , and thus does not change whether or not the following system of equations (where only the first column of the first argument of $\varrho$ differs between the two lines) holds

[TABLE]

Thus the indicator function

[TABLE]

and the cardinal number

[TABLE]

from Equation (B.12) remain unchanged when swapping $\bm{\mathsf{X}}_{\ell}$ and $\tilde{\bm{\mathsf{X}}}_{\ell}$ . 2. (b)

Now we show the same conclusion for the remaining $j$ values, which will require a few intermediate steps. If $q_{j}<q_{\ell}$ , then $\ell\in I_{j}^{(q_{j})}\setminus D^{(q_{j})}$ and $j\in I_{\ell}^{(q_{\ell})}\cap D^{(q_{\ell})}$ . By the graph expanding algorithm, we have $I_{j}^{(q_{j})}\setminus\{\ell\}\subseteq I_{\ell}^{(q_{\ell})}$ , and $D^{(q_{j})}\subseteq D^{(q_{\ell})}$ . This shows

[TABLE]

and

[TABLE]

Summing over Equation (B.14) and rearranging the columns of the first argument of $\varrho$ , one can conclude that

[TABLE]

where the two lines only differ in the third column of the first argument of $\varrho$ . Summing over Equation (B.15) w.r.t. $k_{\tilde{j}}$ , one can further conclude that

[TABLE]

where the two lines only differ in the second column of the first argument of $\varrho$ , and the first column is $\bm{\mathsf{X}}_{j}$ . Similarly, summing over Equation (B.15) w.r.t. $k_{j}$ , we have

[TABLE]

where again the two lines only differ in the second column of the first argument of $\varrho$ , but now the first column is $\tilde{\bm{\mathsf{X}}}_{j}$ .

Note that $\left|\mathcal{M}_{j}\left(\bm{\mathsf{X}}_{j},\bm{\mathsf{X}}_{([p]\cap I_{j}^{(q_{j})})\setminus D^{(q_{j})}},\bm{\mathsf{X}}_{I_{j}^{(q_{j})}\cap D^{(q_{j})}},\tilde{\bm{\mathsf{X}}}_{I_{j}^{(q_{j})}\cap D^{(q_{j})}}\right)\right|$ is a product of some multinomial coefficients, and each multinomial coefficient depends on a unique combination of $(k_{\ell},\bm{k}_{I_{j}^{(q_{j})}\setminus\{\ell\}})$ and the values of

[TABLE]

These quantities are the ones on the right hand side of Equation (B.16). Thus we conclude that $\left|\mathcal{M}_{j}\left(\bm{\mathsf{X}}_{j},\bm{\mathsf{X}}_{([p]\cap I_{j}^{(q_{j})})\setminus D^{(q_{j})}},\bm{\mathsf{X}}_{I_{j}^{(q_{j})}\cap D^{(q_{j})}},\tilde{\bm{\mathsf{X}}}_{I_{j}^{(q_{j})}\cap D^{(q_{j})}}\right)\right|$ remains unchanged when swapping $\bm{\mathsf{X}}_{\ell}$ and $\tilde{\bm{\mathsf{X}}}_{\ell}$ by checking the terms in Equation (B.16) that appear in the multinomial coefficients.

Note that $\tilde{\bm{\mathsf{X}}}_{j}\in\mathcal{M}_{j}\left(\bm{\mathsf{X}}_{j},\bm{\mathsf{X}}_{([p]\cap I_{j}^{(q_{j})})\setminus D^{(q_{j})}},\bm{\mathsf{X}}_{I_{j}^{(q_{j})}\cap D^{(q_{j})}},\tilde{\bm{\mathsf{X}}}_{I_{j}^{(q_{j})}\cap D^{(q_{j})}}\right)$ if and only if the right hand sides of Equations (B.16) and (B.17) are equal for all $k_{\tilde{j}},k_{\ell},\bm{k}_{I_{j}^{(q_{j})}\setminus\{\ell\}}$ , which is equivalent to the left hand sides of the equations are equal, which holds if and only if $\tilde{\bm{\mathsf{X}}}_{j}\in\mathcal{M}_{j}\left(\bm{\mathsf{X}}_{j},\tilde{\bm{\mathsf{X}}}_{\ell},\bm{\mathsf{X}}_{([p]\cap I_{j}^{(q_{j})})\setminus(\{\ell\}\cup D^{(q_{j})})},\bm{\mathsf{X}}_{I_{j}^{(q_{j})}\cap D^{(q_{j})}},\tilde{\bm{\mathsf{X}}}_{I_{j}^{(q_{j})}\cap D^{(q_{j})}}\right)$ . Therefore the indicator function $\mathbf{1}_{\left\{\tilde{\bm{\mathsf{X}}}_{j}\in\mathcal{M}_{j}\left(\bm{\mathsf{X}}_{j},\bm{\mathsf{X}}_{([p]\cap I_{j}^{(q_{j})})\setminus D^{(q_{j})}},\bm{\mathsf{X}}_{I_{j}^{(q_{j})}\cap D^{(q_{j})}},\tilde{\bm{\mathsf{X}}}_{I_{j}^{(q_{j})}\cap D^{(q_{j})}}\right)\right\}}$ remains unchanged when swapping $\bm{\mathsf{X}}_{\ell}$ and $\tilde{\bm{\mathsf{X}}}_{\ell}$ .

To sum up, Equation (B.13) holds for any $\ell\in[p]$ , and the proof is complete. ∎

B.4 Alternative Knockoff Generations for Discrete Markov Chains

We provide alternative constructions of conditional knockoffs for Markov chains that make use of the simple chain structure. Proofs in this section are deferred to Appendix B.4.3.

We introduce a notation that makes the display clear. For any $n\times c$ matrix $\bm{Z}$ and any integers $k_{1},\dots,k_{c}$ , let $\mbox{$ \varrho\left(\bm{Z},k_{1},\dots,k_{c}\right) $}\;:=\;\sum_{i=1}^{n}\mbox{$ \mathbf{1}{\left{Z{i,1}=k_{1},\dots,Z_{i,c}=k_{c}\right}} $}$ , i.e., the number of rows of $\bm{Z}$ that equal the vector $(k_{1},\dots,k_{c})$ .

Suppose the components of $X$ follow a general discrete Markov chain, then the joint distribution for $n$ i.i.d. samples is

[TABLE]

where $\pi^{(1)}_{k}=\mbox{$ \mathbb{P}\left(X_{i,1}=k\right) $}$ and $\pi^{(j)}_{k,k^{\prime}}=\mbox{$ \mathbb{P}\left(\left.X_{i,j}=k^{\prime}\ \right|X_{i,j-1}=k\right) $}$ are model parameters and $N^{(j)}_{k,k^{\prime}}=\mbox{$ \varrho\left([\bm{X}{j-1},\bm{X}{j}],k,k^{\prime}\right) $}$ is the number of samples such that the $(j-1)$ th and $j$ th components are $k$ and $k^{\prime}$ , respectively. So all the $N^{(j)}_{k,k^{\prime}}$ ’s together form a sufficient statistic which we denote by $T(\bm{X})$ , and although it has some redundant entries (for example, $\sum_{k=1}^{{K}_{j-1}}\sum_{k^{\prime}=1}^{{K}_{j}}N^{(j)}_{k,k^{\prime}}=n$ ), it is nevertheless minimal, and we prefer to keep the redundant entries for notational convenience.

Conditional on $T(\bm{X})$ , $\bm{X}$ is uniformly distributed on $\mathcal{Q}\,:=\,\{\bm{\mathsf{W}}\in\prod_{j=1}^{p}[{K}_{j}]^{n}:T(\bm{\mathsf{W}})=T(\bm{X})\}$ . Hereafter, we distinguish notationally between $\bm{\mathsf{X}}$ ’s and $\bm{\mathsf{W}}$ ’s (and $\bm{\tilde{\mathsf{W}}}$ ’s), with the former denoting realized values of the data in $\bm{X}$ and the latter denoting hypothetical such values not necessarily observed in the data. The conditional distribution of $\bm{X}$ can be decomposed as

[TABLE]

where $C_{0}$ only depends on $T(\bm{X})$ . This decomposition implies that conditional on $T(\bm{X})$ , the columns of $\bm{X}$ still comprise a vector-valued Markov chain (see Appendix B.4.3). For ease of notation, in what follows we will write $\mathbb{P}\left(\cdot\right)$ without ‘ $\mid T(\bm{X})$ ’ since we always condition on $T(\bm{X})$ in this section.

B.4.1 SCIP

The sequential conditional independent pairs (SCIP) algorithm from Candès et al., (2018) was introduced in completely general form for any distribution for X with the substantial caveat that actually carrying it out for any given distribution can be quite challenging. Sesia et al., (2018) show how to run SCIP for Markov chains unconditionally. When applied to vectors instead of scalars, SCIP can also be adapted to generate conditional knockoffs for Markov chains because the conditional distribution of $\bm{X}$ is uniform on $\mathcal{Q}$ , making it a Markov chain, and conditional knockoffs are simply knockoffs for this conditional distribution.

SCIP sequentially samples $\tilde{\bm{X}}_{j}\sim\mathcal{L}(\bm{X}_{j}|\bm{X}_{{\text{-}j}},\tilde{\bm{X}}_{1:(j-1)})$ 161616 $\mathcal{L}(\bm{X}_{j}|\bm{X}_{{\text{-}j}},\tilde{\bm{X}}_{1:(j-1)})$ is the conditional distribution of $\bm{X}_{j}$ given $(\bm{X}_{{\text{-}j}},\tilde{\bm{X}}_{1:(j-1)})$ . independently of $\bm{X}_{j}$ , for $j=1,\dots,p$ . For a Markov chain, this sampling can be reduced to $\tilde{\bm{X}}_{j}\sim\mathcal{L}(\bm{X}_{j}|\bm{X}_{j-1},\bm{X}_{j+1},\tilde{\bm{X}}_{j-1})$ . The main computational challenge is to keep track of the following conditional probabilities:

[TABLE]

Algorithm 13 describes how to generate conditional knockoffs for a discrete Markov Chain with finite states by SCIP, where the functions $f_{j}(\cdot)$ are computed recursively by the formulas in Proposition B.5. These formulas are different from the ones in Sesia et al., (2018, Proposition 1), in which the authors assume transition probabilities can be evaluated directly.

Proposition B.5.

Define $\frac{0}{0}=0$ . We formally write $f_{1}(\bm{\mathsf{W}}_{1},\tilde{\bm{\mathsf{X}}}_{0},\bm{\mathsf{W}}_{2})$ for $f_{1}(\bm{\mathsf{W}}_{1},\bm{\mathsf{W}}_{2})$ . Suppose $\tilde{\bm{\mathsf{X}}}$ is a realization of $\tilde{\bm{X}}$ generated by Algorithm 2. Then the following equations hold

[TABLE]

Computing $f_{j}(\bm{\mathsf{W}}_{j},\tilde{\bm{\mathsf{X}}}_{j-1},\bm{\mathsf{W}}_{j+1})$ in Proposition B.5 requires enumerating all possible configurations of $\bm{\mathsf{W}}_{j}\in[{K}_{j}]^{n}$ and $\bm{\mathsf{W}}_{j+1}\in[{K}_{j+1}]^{n}$ , making the total computational complexity of SCIP $O(\sum_{j\leq p-1}({K}_{j}{K}_{j+1})^{n})$ . Due to the $n$ in the exponent, SCIP quickly becomes intractable as the sample size grows, even for binary states and $n\gtrsim 10$ . A simple remedy is to first randomly divide the rows of $\bm{X}$ into disjoint folds of small size around $n_{0}$ , say $n_{0}=10$ , and then run SCIP for each fold separately. This construction conditions on a statistic which is $n/n_{0}$ times as large as that conditioned on before dividing into folds, but the former’s computation time scales linearly with $n$ , instead of exponentially. Conditioning on more should tend to degrade the quality of the knockoffs, but is necessary to enable computation. Still, compared to Algorithm 6, SCIP does not block any variables and thus has the potential to generate better knockoffs.

B.4.2 Refined Blocking

One can apply Algorithm 6 to Markov Chains, as a 2-colorable graph, with two blocking sets, one with all even numbers and the other with all odd numbers. Instead of running Algorithm 5 in Line 3, a refined blocking algorithm, Algorithm 14, can be used for $1<j<p$ . It introduces more variability in the knockoff generation because it first draws a new contingency table $\tilde{\bm{H}}$ that is conditionally exchangeable with the observed contingency table $\bm{H}$ of $(\bm{X}_{j-1},\bm{X}_{j},\bm{X}_{j+1})$ , and then samples $\tilde{\bm{X}}_{j}$ given $\tilde{\bm{H}}$ . This algorithm constructs a reversible Markov Chain by proposing random walks on the space of contingency tables, moved by $\Delta\bm{H}$ and corrected by acceptance ratio $\alpha$ . In the following, we discuss how to sample $\Delta\bm{H}$ and compute $\alpha$ , and provide a detailed version of Algorithm 14 at the end of this section.

Conditional on $T(\bm{X})$ and $\bm{X}_{B}$ , $\bm{X}_{j}$ is uniformly distributed on all $\bm{\mathsf{W}}_{j}\in[{K}_{j}]^{n}$ such that

[TABLE]

for all $(k_{j-1},k_{j},k_{j+1})$ . In the following, we view $T(\bm{X})$ and $\bm{X}_{B}$ as fixed and only $\bm{X}_{j}$ as being random.

We begin with some notation. To avoid burdensome subscripts, we write $(k_{-},k,k_{+})$ for $(k_{j-1},k_{j},k_{j+1})$ . Let $\bm{H}=\bm{H}(\bm{X}_{j})$ be the three-dimensional array with elements $\bm{H}_{k_{-},k,k_{+}}:=\mbox{$ \varrho\left(\bm{X}{(j-1):(j+1)},k{-},k,k_{+}\right) $}$ for all $(k_{-},k,k_{+})$ . The statistic $\bm{H}$ is essentially a three-way contingency table and its three-dimensional marginals satisfy

[TABLE]

Here $M_{k_{-},k_{+}}$ is a function of $\bm{X}_{B}$ and thus fixed. Conditional on $\bm{H}$ , $\bm{X}_{j}$ is uniform on all vectors in $[{K}_{j}]^{n}$ that match the three-way contingency table. The probability function for $\bm{H}$ is

[TABLE]

where the counts $\bm{\mathsf{H}}_{k_{-},k,k_{+}}$ satisfy $\sum\limits_{k\in[{K}_{j}]}\bm{\mathsf{H}}_{k_{-},k,k_{+}}=M_{k_{-},k_{+}}$ for each pair of $(k_{-},k_{+})\in[{K}_{j-1}]\times[{K}_{j+1}]$ .

The construction of the Markov chain on contingency tables begins with defining the basic moves: suppose there are $L$ different three-way tables $\{\bm{\Delta}^{(\ell)}\}_{\ell=1}^{L}\subseteq\mathbb{Z}^{{K}_{j-1}\times{K}_{j}\times{K}_{j+1}}$ such that the marginals of each table $\bm{\Delta}^{(\ell)}$ are [math]’s:171717The set $\{\bm{\Delta}^{(\ell)}\}_{\ell=1}^{L}$ is similar to the Markov bases used in algebraic statistics (see Diaconis et al., (1998)), but it does not need to connect every two possible contingency tables.

[TABLE]

A simple set of basic moves, indexed by $\bm{\ell}$ , can be constructed as follows: for each $\bm{\ell}=(r_{1},r_{2},c_{1},c_{2},d_{1},d_{2})$ where $r_{1},r_{2}\in[{K}_{j-1}]$ , $c_{1},c_{2}\in[{K}_{j+1}]$ , $d_{1},d_{2}\in[{K}_{j}]$ and $r_{1}\neq r_{2}$ , $c_{1}\neq c_{2}$ , $d_{1}\neq d_{2}$ , define

[TABLE]

Algorithm 15 is a detailed sampling procedure, whose validity is guaranteed by Proposition B.6.

Proposition B.6.

For $j\in B^{c}$ , if $\tilde{\bm{X}}_{j}$ is drawn from Algorithm 15, then

[TABLE]

A final remark is that one can generalize the refined blocking algorithm to Ising models. By Equation (3.5), the sufficient statistic is the vector that includes all the counts of configurations of adjacent pairs, i.e., $\sum_{i=1}^{n}\mbox{$ \mathbf{1}{\left{X{i,s}=k,X_{i,t}=k^{\prime}\right}} $}$ for all $k,k^{\prime}\in\{-1,1\}$ and $(s,t)\in E$ . Instead of sampling a three-way contingency table in Algorithm 15, now one has to construct a reversible Markov Chain for the $(2d+1)$ -way contingency table for each vertex and its neighborhood. The basic moves can be constructed similarly as the $\Delta^{(\bm{\ell})}$ given before.

B.4.3 Proofs

Conditional Markov Chains

We first show that conditional on $T(\bm{X})$ , the sequence of $\bm{X}_{j}$ ’s forms a Markov chain, i.e., Equation (B.4) describes a Markov chain. We still write ‘ $\mid T(\bm{X})$ ’ in the probability here to emphasize the dependence on $T(\bm{X})$ .

Summing Equation (B.4) over $\bm{\mathsf{W}}_{p}$ , we have

[TABLE]

thus

[TABLE]

Since the right hand side of the last equation does not involve $\bm{\mathsf{W}}_{1:(p-2)}$ , we conclude

[TABLE]

In addition, let ${\phi}_{p-1}^{\prime}(\bm{\mathsf{W}}_{p-2},\bm{\mathsf{W}}_{p-1}\mid T(\bm{X}))=\phi_{p-1}(\bm{\mathsf{W}}_{p-2},\bm{\mathsf{W}}_{p-1}\mid T(\bm{X}))\sum_{\bm{\mathsf{W}}_{p}^{\prime}}\phi_{p}(\bm{\mathsf{W}}_{p-1},\bm{\mathsf{W}}_{p}^{\prime}\mid T(\bm{X}))$ and for $j<p-1$ , let $\phi_{j}^{\prime}=\phi_{j}$ , (B.20) can be rewritten as

[TABLE]

which has the same form as Equation (B.4). Continuing the same reasoning for $\bm{X}_{p-1},\bm{X}_{p-2},\dots,\bm{X}_{2}$ , we conclude

[TABLE]

that is, the sequence of $\bm{X}_{j}$ ’s is a Markov chain conditional on $T(\bm{X})$ .

SCIP

Proof of Proposition B.5.

The first equation follows from the uniform distribution and the Markovian property

[TABLE]

Next we prove the second equation, except for the case of $j=2$ . However, the second equation with $j=2$ and also the third equation both follow the same proof, by allowing $k_{1}:k_{2}$ for $k_{1}>k_{2}$ to denote the empty set.

Before the proof, we show an implication of Bayes’ rule. For any $\bm{\mathsf{W}}_{j}$ and $\bm{\mathsf{W}}_{j+1}$ ,

[TABLE]

where Equation (B.22) is due to Bayes’ rule, Equation (B.23) is due to the Markovian property and the fact that SCIP sampling of $\tilde{\bm{X}}_{1:(j-2)}$ only depends on $\bm{X}_{1:(j-1)}$ , and Equation (B.24) is because the conditional probability of $\tilde{\bm{X}}_{1:(j-2)}$ does not depend on $\bm{\mathsf{W}}_{j}$ . Note that the normalizing constant in Equation (B.25) depends on $\bm{\mathsf{W}}_{j+1}$ but not on $\bm{\mathsf{W}}_{j}$ .

Now the second equation of Proposition B.5 can be shown as follows

[TABLE]

where the normalizing constant does not depend on $\bm{\mathsf{W}}_{j}$ . Hence we have

[TABLE]

∎

Refined Blocking

Proof of Proposition B.6.

In the following, we view $T(\bm{X})$ and $\bm{X}_{B}$ as fixed and only $\bm{X}_{j}$ being random, and denote this conditional probability by $\mathbb{P}_{j}(\cdot)$ .

We first show $(\bm{H},\tilde{\bm{H}}^{t_{\max}})\,{\buildrel\mathcal{D}\over{=}}\,(\tilde{\bm{H}}^{t_{\max}},\bm{H})$ . Denote the probability mass function in (B.19) as $g(\bm{\mathsf{H}})$ . Since $\tilde{\bm{H}}^{0}=\bm{H}\sim g$ and the transition kernel constructed in Algorithm 15 is in detailed balance with density $g(\bm{\mathsf{H}})$ , $(\tilde{\bm{H}}^{t})_{t=0}^{t_{\max}}$ is reversible. Thus

[TABLE]

By the sampling of $\tilde{\bm{X}}_{j}$ in the algorithm, we have

[TABLE]

and

[TABLE]

Hence

[TABLE]

where the second equality is due to (B.27) and the third equality is due to Equations (B.26) and (B.28). Summing over all $\bm{\mathsf{H}},\tilde{\bm{\mathsf{H}}}$ , we conclude that $(\bm{X}_{j},\tilde{\bm{X}}_{j})\,{\buildrel\mathcal{D}\over{=}}\,(\tilde{\bm{X}}_{j},\bm{X}_{j})\mid\bm{X}_{B}$ . ∎

Appendix C Conditional Hypothesis

This section concerns the hypotheses actually tested by conditional knockoffs. Suppose $(\bm{x}_{i},Y_{i})\stackrel{{\scriptstyle\emph{i.i.d.}}}{{\sim}}(X,Y)$ and $T(\bm{X})$ is a statistic of $\bm{X}$ . The knockoff procedure using conditional knockoffs treats the variables in $\mathcal{H}_{0,T}=\{j\;:\;\bm{y}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{X}_{j}\mid\bm{X}_{\text{-}j},T(\bm{X})\}$ as null. It is of interest to compare $\mathcal{H}_{0,T}$ with $\mathcal{H}_{0}$ , the original set of null variables defining the variable selection problem we actually care about.

Proposition C.1.

$\mathcal{H}_{0}\subseteq\mathcal{H}_{0,T}$ .

Proof.

Suppose $j\in\mathcal{H}_{0}$ . For i.i.d. data, $j\in\mathcal{H}_{0}$ implies $Y_{i}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{i,j}\mid X_{i,{\text{-}j}}$ , which together with the independence among $\{(Y_{i},\bm{x}_{i})\}_{i=1}^{n}$ implies $\bm{y}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{X}_{j}\mid\bm{X}_{\text{-}j}$ . Note that $\bm{y}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}T(\bm{X})\mid\bm{X}$ (since $T(\bm{X})$ is deterministic given $\bm{X}$ ), which together with $\bm{y}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{X}_{j}\mid\bm{X}_{\text{-}j}$ implies $\bm{y}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}(\bm{X}_{j},T(\bm{X}))\mid\bm{X}_{\text{-}j}$ by the contraction property of conditional independence. And by the weak union property of conditional independence, $\bm{y}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}(\bm{X}_{j},T(\bm{X}))\mid\bm{X}_{\text{-}j}$ implies $\bm{y}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{X}_{j}\mid\bm{X}_{\text{-}j},T(\bm{X})$ . Thus $j\in\mathcal{H}_{0,T}$ . This holds for arbitrary $j\in\mathcal{H}_{0}$ and thus $\mathcal{H}_{0}\subseteq\mathcal{H}_{0,T}$ . ∎

The converse is not true in general, for instance if $T(\bm{X})=\bm{X}$ and $\mathcal{H}_{0}=\emptyset$ , then all variables are automatically null conditional on $T(\bm{X})$ and thus $\mathcal{H}_{0}\subsetneq\mathcal{H}_{0,T}$ . In general, when $T(\bm{X})$ does not allow full reconstruction of $\bm{X}_{j}$ it should be rare for a non-null variable $\bm{X}_{j}$ to be null conditional on $T(\bm{X})$ , as this can only happen if there is a perfect synergy of $F_{Y|X}$ and $F_{X}$ so that $F_{Y|X}$ is only a function of $X_{j}$ through a transformation computable from the sufficient statistic $T(\bm{X})$ of $F_{X}$ . For most problems of interest, Theorem C.2 provides a sufficient condition for $\bm{y}\mathbin{\mathchoice{\hbox to0.0pt{\hbox{$ \displaystyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \displaystyle\perp $}}{\hbox to0.0pt{\hbox{$ \textstyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \textstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptstyle\perp $}\hss}\kern 2.36812pt{\not}\kern 2.36812pt\hbox{$ \scriptstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptscriptstyle\perp $}\hss}\kern 1.63437pt{\not}\kern 1.63437pt\hbox{$ \scriptscriptstyle\perp $}}}\bm{X}_{j}\mid\bm{X}_{\text{-}j},T(\bm{X})$ , i.e., $\mathcal{H}_{0}=\mathcal{H}_{0,T}$ : the conditional mean of $Y_{i}$ (or some transformation of $Y_{i}$ ) given $\bm{x}_{i}$ , say $\phi(\bm{x}_{i})$ , should not be deterministic after conditioning on $\bm{X}_{{\text{-}j}}$ and $T(\bm{X})$ .

Theorem C.2.

Suppose for a bounded function $g(y)$ and $\phi(\bm{x}):=\,\mbox{$ \mathbb{E}\left[\left.g(Y)\ \right|X=\bm{x}\right] $}$ , there exist two disjoint Borel sets $B_{1},B_{2}\subset\mathbb{R}^{p}$ such that $\inf_{\bm{x}\in B_{1}}\phi(\bm{x})>\sup_{\bm{x}\in B_{2}}\phi(\bm{x})$ . If for each $j\in[p]$ , it holds with positive probability that

[TABLE]

then $\mathcal{H}_{0}=\mathcal{H}_{0,T}$ .

This theorem is based on Proposition C.1 and the following Proposition C.3. By Proposition C.3, for each $j\notin\mathcal{H}_{0}$ , it holds that $j\notin\mathcal{H}_{0,T}$ ; hence $\mathcal{H}_{0,T}\subseteq\mathcal{H}_{0}$ . In addition, Proposition C.1 shows $\mathcal{H}_{0}\subseteq\mathcal{H}_{0,T}$ , thus $\mathcal{H}_{0}=\mathcal{H}_{0,T}$ .

Proposition C.3.

Suppose $Y\mathbin{\mathchoice{\hbox to0.0pt{\hbox{$ \displaystyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \displaystyle\perp $}}{\hbox to0.0pt{\hbox{$ \textstyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \textstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptstyle\perp $}\hss}\kern 2.36812pt{\not}\kern 2.36812pt\hbox{$ \scriptstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptscriptstyle\perp $}\hss}\kern 1.63437pt{\not}\kern 1.63437pt\hbox{$ \scriptscriptstyle\perp $}}}X_{j}\mid X_{{\text{-}j}}$ , and $g(y)$ is a bounded function. Define $K:=\,\mbox{$ \mathbb{E}\left[\left.g(Y_{1})\ \right|\bm{x}{1}\right] $}$ , and $M:=\,\mbox{$ \mathbb{E}\left[\left.K\ \right|\bm{X}{{\text{-}j}},T(\bm{X})\right] $}$ .

(a)

If $K$ is different from $M$ , then

[TABLE] 2. (b)

If $K$ can be written as $\phi(\bm{x}_{1})$ and $\phi$ is not constant on the support of the conditional distribution of $\bm{x}_{1}$ given $\bm{X}_{\text{-}j}$ and $T(\bm{X})$ , i.e., there exist two disjoint Borel sets $B_{1},B_{2}$ such that $\inf_{\bm{x}\in B_{1}}\phi(\bm{x})>\sup_{\bm{x}\in B_{2}}\phi(\bm{x})$ , and

[TABLE]

then $K$ is different from $M$ .

To prove this proposition, we need the following lemma.

Lemma C.4.

Suppose $Y\mathbin{\mathchoice{\hbox to0.0pt{\hbox{$ \displaystyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \displaystyle\perp $}}{\hbox to0.0pt{\hbox{$ \textstyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \textstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptstyle\perp $}\hss}\kern 2.36812pt{\not}\kern 2.36812pt\hbox{$ \scriptstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptscriptstyle\perp $}\hss}\kern 1.63437pt{\not}\kern 1.63437pt\hbox{$ \scriptscriptstyle\perp $}}}X$ and $T$ is a function of $X$ . Furthermore, if there exists a bounded function $g$ such that $K\;:=\;\mbox{$ \mathbb{E}\left[\left.g(Y)\ \right|X\right] $}$ is not conditionally deterministic in the following sense:

[TABLE]

then $Y\mathbin{\mathchoice{\hbox to0.0pt{\hbox{$ \displaystyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \displaystyle\perp $}}{\hbox to0.0pt{\hbox{$ \textstyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \textstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptstyle\perp $}\hss}\kern 2.36812pt{\not}\kern 2.36812pt\hbox{$ \scriptstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptscriptstyle\perp $}\hss}\kern 1.63437pt{\not}\kern 1.63437pt\hbox{$ \scriptscriptstyle\perp $}}}X\;|\;T$ .

Proof.

Let $M:=\;\mbox{$ \mathbb{E}\left[\left.K\ \right|T\right] $}$ . Then $M$ is $\sigma(T)$ -measurable. Since $\mbox{$ \mathbb{P}\left(K\neq M\right) $}=\mbox{$ \mathbb{P}\left(K>M\right) $}+\mbox{$ \mathbb{P}\left(K<M\right) $}$ , without loss of generality, we assume $0<\mbox{$ \mathbb{P}\left(K>M\right) $}$ .

We compute $\mathbb{E}\left[\left.g(Y)\mbox{$ \mathbf{1}_{\left{K>M\right}} $}\ \right|T\right]$ in two different ways. On the one hand,

[TABLE]

On the other hand, if $Y\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X\;|\;T$ then

[TABLE]

Combining these two expressions shows that if $Y\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X\;|\;T$ then ${\mbox{$ \mathbb{E}\left[\left.(K-M)\mbox{ $\mathbf{1}_{\left\{K>M\right\}}$ }\ \right|T\right] $}\overset{a.s.}{=}0}$ , and thus $\mbox{$ \mathbb{E}\left[(K-M)\mbox{ $\mathbf{1}_{\left\{K>M\right\}}$ }\right] $}{=}0$ . However, this implies $\mbox{$ \mathbb{P}\left(K>M\right) $}=0$ and contradicts the condition; hence $Y\mathbin{\mathchoice{\hbox to0.0pt{\hbox{$ \displaystyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \displaystyle\perp $}}{\hbox to0.0pt{\hbox{$ \textstyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \textstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptstyle\perp $}\hss}\kern 2.36812pt{\not}\kern 2.36812pt\hbox{$ \scriptstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptscriptstyle\perp $}\hss}\kern 1.63437pt{\not}\kern 1.63437pt\hbox{$ \scriptscriptstyle\perp $}}}X\;|\;T$ . ∎

Proof of Proposition C.3.

(a)

The condition that $Y\mathbin{\mathchoice{\hbox to0.0pt{\hbox{$ \displaystyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \displaystyle\perp $}}{\hbox to0.0pt{\hbox{$ \textstyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \textstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptstyle\perp $}\hss}\kern 2.36812pt{\not}\kern 2.36812pt\hbox{$ \scriptstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptscriptstyle\perp $}\hss}\kern 1.63437pt{\not}\kern 1.63437pt\hbox{$ \scriptscriptstyle\perp $}}}X_{j}\mid X_{{\text{-}j}}$ implies $Y_{1}\mathbin{\mathchoice{\hbox to0.0pt{\hbox{$ \displaystyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \displaystyle\perp $}}{\hbox to0.0pt{\hbox{$ \textstyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \textstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptstyle\perp $}\hss}\kern 2.36812pt{\not}\kern 2.36812pt\hbox{$ \scriptstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptscriptstyle\perp $}\hss}\kern 1.63437pt{\not}\kern 1.63437pt\hbox{$ \scriptscriptstyle\perp $}}}\bm{X}_{j}\mid\bm{X}_{{\text{-}j}}$ (see Lemma C.5 below). Because $\bm{x}_{1}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{x}_{2:n}$ , we have $K=\mbox{$ \mathbb{E}\left[\left.g(Y_{1})\ \right|\bm{x}{1},\bm{x}{2:n}\right] $}=\mbox{$ \mathbb{E}\left[\left.g(Y_{1})\ \right|\bm{X}{j},\bm{X}{{\text{-}j}}\right] $}$ . The condition $\mathbb{P}(K\neq M)>0$ implies that $\mbox{$ \mathbb{P}\left(\left.K\neq M\ \right|\bm{X}_{\text{-}j}\right) $}>0$ holds with positive probability.

To apply Lemma C.4, $\bm{X}_{{\text{-}j}}$ is treated as fixed, and $\bm{X}_{j}$ (resp. $Y_{1}$ ) is treated as $X$ (resp. $Y$ ). Then we have $Y_{1}\mathbin{\mathchoice{\hbox to0.0pt{\hbox{$ \displaystyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \displaystyle\perp $}}{\hbox to0.0pt{\hbox{$ \textstyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \textstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptstyle\perp $}\hss}\kern 2.36812pt{\not}\kern 2.36812pt\hbox{$ \scriptstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptscriptstyle\perp $}\hss}\kern 1.63437pt{\not}\kern 1.63437pt\hbox{$ \scriptscriptstyle\perp $}}}\bm{X}_{j}\mid\bm{X}_{{\text{-}j}},T(\bm{X})$ , which immediately implies $\bm{y}\mathbin{\mathchoice{\hbox to0.0pt{\hbox{$ \displaystyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \displaystyle\perp $}}{\hbox to0.0pt{\hbox{$ \textstyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \textstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptstyle\perp $}\hss}\kern 2.36812pt{\not}\kern 2.36812pt\hbox{$ \scriptstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptscriptstyle\perp $}\hss}\kern 1.63437pt{\not}\kern 1.63437pt\hbox{$ \scriptscriptstyle\perp $}}}\bm{X}_{j}\mid\bm{X}_{\text{-}j},T(\bm{X})$ . 2. (b)

The existence of $B_{1}$ and $B_{2}$ implies that there exists a real number $s$ such that

[TABLE]

We will prove by contradiction that $K$ is different from $M$ . Suppose $\mbox{$ \mathbb{P}\left(K\neq M\right) $}=0$ , then $\mbox{$ \mathbb{P}\left(K\neq M\left|\ \bm{X}_{{\text{-}j}},T(\bm{X})\right.\right) $}\overset{a.s.}{=}0$ . Thus a.s. we have

[TABLE]

Similarly $\mbox{$ \mathbb{P}\left(X_{1}\in B_{2}\left|\ \bm{X}{{\text{-}j}},T(\bm{X})\right.\right) $}\overset{a.s.}{\leq}\mbox{$ \mathbf{1}{\left{M<s\right}} $}$ . Since $\mbox{$ \mathbf{1}{\left{M>s\right}} $}\cdot\mbox{$ \mathbf{1}{\left{M<s\right}} $}=0$ , it follows that

[TABLE]

which contradicts the condition that $0<\mbox{$ \mathbb{P}\left(\mbox{ $\mathbb{P}\left(X_{1}\in B_{i}\left|\ \bm{X}_{{\text{-}j}},T(\bm{X})\right.\right)$ }>0,i=1,2\right) $}$ . Hence we conclude $\mbox{$ \mathbb{P}\left(K\neq M\right) $}>0$ .

∎

Lemma C.5.

If $Y\mathbin{\mathchoice{\hbox to0.0pt{\hbox{$ \displaystyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \displaystyle\perp $}}{\hbox to0.0pt{\hbox{$ \textstyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \textstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptstyle\perp $}\hss}\kern 2.36812pt{\not}\kern 2.36812pt\hbox{$ \scriptstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptscriptstyle\perp $}\hss}\kern 1.63437pt{\not}\kern 1.63437pt\hbox{$ \scriptscriptstyle\perp $}}}X\mid U$ and $(X,Y,U)\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}(V,W)$ , then $Y\mathbin{\mathchoice{\hbox to0.0pt{\hbox{$ \displaystyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \displaystyle\perp $}}{\hbox to0.0pt{\hbox{$ \textstyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \textstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptstyle\perp $}\hss}\kern 2.36812pt{\not}\kern 2.36812pt\hbox{$ \scriptstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptscriptstyle\perp $}\hss}\kern 1.63437pt{\not}\kern 1.63437pt\hbox{$ \scriptscriptstyle\perp $}}}(X,V)\mid(U,W)$ .

Proof.

Suppose $Y\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}(X,V)\mid(U,W)$ . Then

[TABLE]

The condition that $(X,Y,U)\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}(V,W)$ implies $(X,Y,U)\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}W$ , and thus by weak union property of conditional independence, we have

[TABLE]

Equations (C.1) and (C.2) together with the contraction property of conditional independence imply $Y\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X\mid U$ . This contradicts the condition, so we conclude that $Y\mathbin{\mathchoice{\hbox to0.0pt{\hbox{$ \displaystyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \displaystyle\perp $}}{\hbox to0.0pt{\hbox{$ \textstyle\perp $}\hss}\kern 3.46875pt{\not}\kern 3.46875pt\hbox{$ \textstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptstyle\perp $}\hss}\kern 2.36812pt{\not}\kern 2.36812pt\hbox{$ \scriptstyle\perp $}}{\hbox to0.0pt{\hbox{$ \scriptscriptstyle\perp $}\hss}\kern 1.63437pt{\not}\kern 1.63437pt\hbox{$ \scriptscriptstyle\perp $}}}(X,V)\mid(U,W)$ . ∎

Appendix D Supplementary Simulations

D.1 Nonlinear Response Models

We re-conduct the same the simulations in Section 3 on logistic regression, confirming that the variable selection by using conditional knockoff allows for general response dependence. The experiments follow the same designs as in Sections 3.1.2, 3.2.2 and 3.3.3, but with binary responses sampled as $Y_{i}\mid\bm{x}_{i}\sim\text{Bernoulli}(\varsigma(\bm{x}_{i}^{\top}\bm{\beta}/\sqrt{n}))$ , where $\varsigma(t)=e^{t}/(1+e^{t})$ is the logistic function, and slightly larger sample sizes $n$ for the re-conducted simulations of Sections 3.2.2 and 3.3.3.

D.2 Varying the Sparsity and Magnitude of the Regression Coefficients

The following simulations reproduce Figure 2a but with varying sparsities and magnitudes. Specifically, the sparsity level $k$ varies between $30$ , $60$ , and $90$ , and the nonzero entries are randomly sampled from Unif $(1,2)$ . The message from these experiments is the same as those in the main paper, that is, the power of conditional knockoffs is almost the same as that of unconditional knockoffs even though it does not know the exact distribution of $\bm{X}$ .

D.3 Power of Different Sufficient Statistics

We provide the following experiment to examine the power performance of conditional knockoffs that are generated using different sufficient statistics. It confirms our intuition that conditioning on more generally leads to lower power.

Specifically, for a Gaussian graphical model, we have run Algorithm 4 for a sequence of nested sufficient statistics to see how this choice affects the power. In the following simulation, $X$ is sampled from an AR $(1)$ distribution with autocorrelation coefficient 0.3, and models of (nonstationary) AR( $q$ ) with various $q\geq 1$ are used to model $X$ , i.e., each model assume a banded precision matrix with bandwidth $q$ , and we increase $q$ beyond 1 to study the effect of more conditioning. Since the models are nested, all of them lead to valid conditional knockoffs. As $q$ grows, the graphical model gets denser and the sufficient statistic conditioned on in Algorithm 4 contains more elements (and always contains all the elements of the sufficient statistic conditioned on for all smaller $q$ ), which can be done by choosing two increasing sequences of blocking sets for the two split data folds and making sure that these two sequences never intersect with each other. Thus, we expect to see some loss of power when $q$ increases. We chose $n=400$ , $p=2000$ , and the algorithmic parameter $n^{\prime}$ to be set to $160$ , and produced results for a range of $Y\mid X$ ’s linear model coefficient amplitudes and for $q$ ranging from 1–30; see Figure 10. Although a larger value of $q$ indeed lowers the power, the loss is relatively small in this example despite conditional knockoffs with $q=30$ conditioning on far more than with $q=1$ .

D.4 Gaussian Graphical Models with Unknown Edge Supersets

The conditional knockoff generation in Section 3.2 requires knowing a superset of the true edge set of the Gaussian graphical model. We present the following experiment to preliminarily examine the idea mentioned in Remark 3 for when such a superset is not known a priori and one has to estimate the edge set (or its superset) using the data.

Suppose the true covariance matrix $\bm{\Sigma}$ is a rescaled (to have diagonal entries equal to 1) version of $\bm{\Sigma}^{(0)}$ , where $(\bm{\Sigma}^{(0)})_{j,k}^{-1}=\bm{1}_{j=k}-\frac{1}{7}\bm{1}_{1\leq|j-k|\leq 3}$ . In other words, every node in the true graph is connected to its 6 nearest neighbors. In the following simulations, we set $p=400$ and $n=200$ .

We can estimate the edge set $E$ by the nonzero entries $\hat{E}$ of the estimated precision matrix using the graphical Lasso (Friedman et al.,, 2008), which is implemented via the R package huge. The tuning parameter of the graphical Lasso is selected by the standard method StARS (Liu et al.,, 2010). Once $\hat{E}$ is computed, we can then construct conditional knockoffs as if $\hat{E}$ were given. The blocking set used in our algorithms is obtained by Algorithm 2 with input $n^{\prime}=80$ . We additionally consider the case where a set of $n_{u}=1,600$ unlabeled data points is available and is used together with the labeled covariates to estimate the edge set.

The FDR and power curves are shown in Figure 11. “Label Cond.” refers to conditional knockoffs generated with $\hat{E}$ estimated using only labeled data, and “Unlabel. Cond.” refers to conditional knockoffs with $\hat{E}$ estimated additionally with the unlabeled data. Both methods control the FDR and in fact are conservative. One might attribute the FDR control to some over-conservative choice of graphical Lasso tuning parameter, but in fact $\hat{E}$ estimated with just the labeled data, although it tended to find a larger graph than the truth (its maximal degree was often above 20), also missed around 40 true edges on average. Unsurprisingly, the use of unlabeled data improves the power by improving the estimate of the edge set, with much fewer false negative and false positive edges.

D.5 Robustness to Model Misspecification

The current paper focuses on the cases where the models for the covariates are known and well-specified. In practice, practitioners may not know what the true model is. Here we provide an experiment to examine the robustness of Gaussian conditional knockoffs ( $n>2p$ ). The following robustness experiment constructs a set of distributions that approximate a multivariate Gaussian by discretizing it at different resolutions by varying a parameter $K$ .

We first generate $X^{(0)}\sim\mathcal{N}(0,\bm{\Sigma})$ where $\bm{\Sigma}_{i,j}=0.3^{|i-j|}$ , and then discretize each coordinate as follows

[TABLE]

In other words, $X^{(0)}_{j}$ is rounded to the nearest $\frac{1}{K}$ -grid value. Since $|X_{j}-X^{(0)}_{j}|\leq\frac{1}{K}$ , the larger $K$ , the closer $X$ is to a multivariate Gaussian vector, and indeed as $K\rightarrow\infty$ , $X\rightarrow X^{(0)}$ and becomes multivariate Gaussian. However, for small $K$ , the distribution is very far from Gaussian. For conditional knockoffs, we pretend that $X$ is drawn from a multivariate Gaussian distribution and directly apply Algorithm 3.1. To get a baseline for power (since changing $K$ not only affects the model misspecification, but also changes the nature of the data-generating distribution and thus the power of any procedure), we also generate exactly-valid unconditional knockoffs for $X$ by discretizing an unconditional knockoff of $X^{(0)}$ with the same $K$ (of course this procedure would be impossible in practice, since $X^{(0)}$ is unobserved).

We fix $p=1,000$ , linear model coefficient amplitude at $4$ , vary $n\in\{2001,3000,4000\}$ and vary $K\in\{1/2,1,2,3,\infty\}$ . The other details of the experiment are the same as in Figure 1a of the paper, where the response is drawn from $Y_{i}\mid X_{i}\sim N(X_{i}^{\top}\bm{\beta}/\sqrt{n},1)$ . The result is shown in Figure 12.

Note that $K=1/2$ produces a distribution that is almost entirely concentrated on just three values $\{-2,0,2\}$ , making it extremely non-Gaussian, yet the FDR is controlled quite well for all values of $n$ at this $K$ value and all others. The power difference between conditional knockoffs and unconditional knockoffs is also quite insensitive to $K$ and, as seen in all other simulations, quite small for all $n$ except when $n\approx 2p$ (in the $n\approx 2p$ setting the power gap is substantial, although conditional knockoffs still has quite a bit of power).

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Barber and Candès, (2015) Barber, R. F. and Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. Ann. Statist. , 43(5):2055–2085.
2Barber et al., (2018) Barber, R. F., Candès, E. J., and Samworth, R. J. (2018). Robust inference with knockoffs. ar Xiv preprint ar Xiv:1801.03896 .
3Bates et al., (2020) Bates, S., Candès, E., Janson, L., and Wang, W. (2020). Metropolized knockoff sampling. Journal of the American Statistical Association , pages 1–15.
4Belloni et al., (2014) Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies , 81(2):608–650.
5Belloni et al., (2015) Belloni, A., Chernozhukov, V., and Kato, K. (2015). Uniform post-selection inference for least absolute deviation regression and other z-estimation problems. Biometrika , 102(1):77–94.
6Benjamini and Hochberg, (1995) Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) , 57(1):pp. 289–300.
7Benjamini and Yekutieli, (2001) Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist. , 29(4):1165–1188.
8Berrett et al., (2018) Berrett, T. B., Wang, Y., Barber, R. F., and Samworth, R. J. (2018). The conditional permutation test. ar Xiv preprint ar Xiv:1807.05405 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Relaxing the Assumptions of Knockoffs by Conditioning

Abstract

1 Introduction

1.1 Problem statement

1.2 Our contribution

1.3 Related work

1.4 Outline

2 Main Idea and General Principles

2.1 Model-X Knockoffs

Definition 2.1** (Model-X knockoff matrix).**

Theorem 2.1**.**

2.2 Conditional Knockoffs

Definition 2.2** (Conditional model-X knockoff matrix).**

Proposition 2.2**.**

Remark 1**.**

Remark 2**.**

2.3 Integrating Unlabeled Data

Proposition 2.3**.**

3 Conditional Knockoffs for Three Models of Interest

3.1 Low-Dimensional Multivariate Gaussian Model

3.1.1 Generating Conditional Knockoffs

Theorem 3.1**.**

3.1.2 Numerical Examples

3.2 Gaussian Graphical Model

Remark 3**.**

3.2.1 Generating Conditional Knockoffs by Blocking

Definition 3.1**.**

Theorem 3.2**.**

Proposition 3.3**.**

Definition 3.2**.**

3.2.2 Numerical Examples

3.3 Discrete Graphical Model

3.3.1 Generating Conditional Knockoffs by Blocking

Theorem 3.4**.**

3.3.2 Refined Constructions for Markov Chains

3.3.3 Numerical Examples

4 Discussion

Acknowledgments

Appendix A Proofs for Main Text

A.1 Integration of Unlabled Data

Proof of Proposition 2.3.

A.2 Low-Dimensional Gaussian Models

A.2.1 Counterexample for Conditional Uniformity

A.2.2 Proof of Theorem 3.1

Lemma A.1** (Invariance).**

Lemma A.2** (Invariance after swapping).**

Lemma A.3** (Uniqueness).**

Remark 4**.**

A.2.3 Proofs of Lemmas

Proof of Lemma A.1.

Proof of Lemma A.2.

Lemma A.4**.**

Proof.

Lemma A.5**.**

Proof.

Proof of Lemma A.3.

Lemma A.6** (Theorem 13.1.5 in Schneider and Weil, (2008)).**

Lemma A.7**.**

Proof of Lemma A.7.

A.2.4 An Intuitive Proof That Does Not Quite Work

A.3 Gaussian Graphical Models

Proof of Theorem 3.2.

Proof of Proposition 3.3.

A.4 Discrete Graphical Models

Proof of Theorem 3.4.

Appendix B Algorithmic Details

B.1 Low Dimensional Gaussian

B.1.1 Additional Details on Algorithm 1

B.1.2 Gaussian Knockoffs with Known Mean

B.1.3 Partial Gaussian Knockoffs with Fixed Columns

Proposition B.1**.**

Proof.

Theorem B.2**.**

Definition 2.1 (Model-X knockoff matrix).

Theorem 2.1.

Definition 2.2 (Conditional model-X knockoff matrix).

Proposition 2.2.

Remark 1.

Remark 2.

Proposition 2.3.

Theorem 3.1.

Remark 3.

Definition 3.1.

Theorem 3.2.

Proposition 3.3.

Definition 3.2.

Theorem 3.4.

Lemma A.1 (Invariance).

Lemma A.2 (Invariance after swapping).

Lemma A.3 (Uniqueness).

Remark 4.

Lemma A.4.

Lemma A.5.

Lemma A.6 (Theorem 13.1.5 in Schneider and Weil, (2008)).

Lemma A.7.

Proposition B.1.

Theorem B.2.

Proposition B.3.

B.2.3 Examples of $(m,n)$ -Coverable Graphs

Example 1 (Time-inhomogeneous Autoregressive Models ).

Example 2 ( $d$ -dimensional Square-lattice Models ).

Example 3.

Theorem B.4.

Proposition B.5.

Proposition B.6.

Proposition C.1.

Theorem C.2.

Proposition C.3.

Lemma C.4.

Lemma C.5.