Selective Inference for Testing Trees and Edges in Phylogenetics

Hidetoshi Shimodaira; Yoshikazu Terada

arXiv:1902.04964·stat.AP·May 27, 2019

Selective Inference for Testing Trees and Edges in Phylogenetics

Hidetoshi Shimodaira, Yoshikazu Terada

PDF

TL;DR

This paper introduces a selective inference method for testing phylogenetic trees and edges, adjusting for selection bias, and providing more accurate p-values for phylogenetic model testing.

Contribution

It proposes a novel selective inference framework that improves upon previous tests by controlling for selection bias in phylogenetic analysis.

Findings

01

Selective p-values effectively control type-I error conditioned on selection.

02

The method is applicable to broader model selection problems.

03

Illustrated with a controversial phylogenetic case study.

Abstract

Selective inference is considered for testing trees and edges in phylogenetic tree selection from molecular sequences. This improves the previously proposed approximately unbiased test by adjusting the selection bias when testing many trees and edges at the same time. The newly proposed selective inference $p$ -value is useful for testing selected edges to claim that they are significantly supported if $p > 1 - α$ , whereas the non-selective $p$ -value is still useful for testing candidate trees to claim that they are rejected if $p < α$ . The selective $p$ -value controls the type-I error conditioned on the selection event, whereas the non-selective $p$ -value controls it unconditionally. The selective and non-selective approximately unbiased $p$ -values are computed from two geometric quantities called signed distance and mean curvature of the region representing tree or edge of interest…

Tables3

Table 1. Table 1: Three types of p 𝑝 p -values (BP, AU, SI) and geometric quantities ( β 0 , β 1 subscript 𝛽 0 subscript 𝛽 1 \beta_{0},\beta_{1} ) for the best 20 trees. Standard errors are shown in parentheses. Boldface indicates significance ( p < 0.05 𝑝 0.05 p<0.05 ) for the null hypothesis that the tree is true (outside mode). For the rest of trees (T21 , … , ,\ldots, T105), p 𝑝 p -values are very small ( p < 0.001 𝑝 0.001 p<0.001 ).

tree	BP	AU	SI	$β_{0}$		$β_{1}$		topology	edges
T1^†	0.559 (0.001)	0.752 (0.001)	0.372 (0.001)	$- 0.41$	(0.00)	$0.27$	(0.00)	(((1(23))4)56)	E1,E2,E3
T2	0.304 (0.000)	0.467 (0.001)	0.798 (0.001)	$0.30$	(0.00)	$0.22$	(0.00)	((1((23)4))56)	E1,E2,E4
T3	0.038 (0.000)	0.126 (0.002)	0.202 (0.003)	$1.46$	(0.01)	$0.32$	(0.00)	(((14)(23))56)	E1,E2,E5
T4	0.014 (0.000)	0.081 (0.002)	0.124 (0.003)	$1.79$	(0.01)	$0.40$	(0.01)	((1(23))(45)6)	E1,E3,E6
T5	0.032 (0.000)	0.127 (0.002)	0.199 (0.003)	$1.50$	(0.01)	$0.36$	(0.00)	(1((23)(45))6)	E1,E6,E7
T6	0.005 (0.000)	0.032 (0.002)	0.050 (0.002)	$2.21$	(0.02)	$0.35$	(0.01)	(1(((23)4)5)6)	E1,E4,E7
T7^‡	0.015 (0.000)	0.100 (0.003)	0.150 (0.003)	$1.72$	(0.01)	$0.44$	(0.01)	((1(45))(23)6)	E1,E6,E8
T8	0.001 (0.000)	0.011 (0.001)	0.016 (0.002)	$2.74$	(0.03)	$0.43$	(0.02)	((15)((23)4)6)	E1,E4,E9
T9	0.000 (0.000)	0.001 (0.000)	0.001 (0.000)	$3.67$	(0.09)	$0.46$	(0.04)	(((1(23))5)46)	E1,E3,E10
T10	0.002 (0.000)	0.022 (0.002)	0.033 (0.002)	$2.43$	(0.02)	$0.42$	(0.01)	(((15)4)(23)6)	E1,E8,E9
T11	0.000 (0.000)	0.004 (0.001)	0.006 (0.002)	$3.14$	(0.07)	$0.51$	(0.03)	(((14)5)(23)6)	E1,E5,E8
T12	0.000 (0.000)	0.000 (0.000)	0.001 (0.000)	$3.78$	(0.09)	$0.41$	(0.04)	(((15)(23))46)	E1,E9,E10
T13	0.000 (0.000)	0.000 (0.000)	0.001 (0.001)	$3.96$	(0.19)	$0.54$	(0.09)	(1(((23)5)4)6)	E1,E7,E11
T14	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$4.66$	(0.31)	$0.65$	(0.12)	((14)((23)5)6)	E1,E5,E11
T15	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$5.28$	(0.34)	$0.43$	(0.11)	((1((23)5))46)	E1,E10,E11
T16	0.000 (0.000)	0.000 (0.000)	0.001 (0.000)	$3.63$	(0.04)	$0.23$	(0.01)	((((13)2)4)56)	E2,E3,E12
T17	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$3.81$	(0.04)	$0.22$	(0.01)	((((12)3)4)56)	E2,E3,E13
T18	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$4.33$	(0.10)	$0.34$	(0.03)	(((13)2)(45)6)	E3,E6,E12
T19	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$4.36$	(0.11)	$0.32$	(0.04)	(((12)3)(45)6)	E3,E6,E13
T20	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$3.90$	(0.12)	$0.44$	(0.05)	(((1(45))2)36)	E6,E8,E14

Table 2. Table 2: Three types of p 𝑝 p -values (BP, AU, SI) and geometric quantities ( β 0 , β 1 subscript 𝛽 0 subscript 𝛽 1 \beta_{0},\beta_{1} ) for all the 25 edges of six taxa. Standard errors are shown in parentheses. Boldface without underline indicates significance ( p < 0.05 𝑝 0.05 p<0.05 ) for the null hypothesis that the edge is true (outside mode). Boldface with underline indicates significance ( p > 0.95 𝑝 0.95 p>0.95 ) for the null hypothesis that the edge is not true (inside mode).

edge	BP	AU	SI	$β_{0}$		$β_{1}$		clade
E1^†‡	1.000 (0.000)	1.000 (0.000)	1.000 (0.000)	$- 3.87$	(0.03)	$0.16$	(0.01)	-++---
E2^†	0.930 (0.000)	0.956 (0.001)	0.903 (0.001)	$- 1.59$	(0.00)	$0.12$	(0.00)	++++--
E3^†	0.580 (0.001)	0.719 (0.001)	0.338 (0.001)	$- 0.39$	(0.00)	$0.19$	(0.00)	+++---
E4	0.318 (0.000)	0.435 (0.001)	0.775 (0.001)	$0.32$	(0.00)	$0.16$	(0.00)	-+++--
E5	0.037 (0.000)	0.124 (0.002)	0.198 (0.002)	$1.47$	(0.01)	$0.32$	(0.00)	+--+--
E6^‡	0.060 (0.000)	0.074 (0.001)	0.141 (0.002)	$1.50$	(0.00)	$0.05$	(0.00)	---++-
E7	0.038 (0.000)	0.091 (0.002)	0.154 (0.002)	$1.56$	(0.01)	$0.22$	(0.00)	-++++-
E8^‡	0.018 (0.000)	0.068 (0.002)	0.110 (0.003)	$1.80$	(0.01)	$0.31$	(0.01)	+--++-
E9	0.003 (0.000)	0.014 (0.001)	0.023 (0.002)	$2.48$	(0.02)	$0.27$	(0.02)	+---+-
E10	0.000 (0.000)	0.000 (0.000)	0.001 (0.000)	$3.72$	(0.07)	$0.29$	(0.03)	+++-+-
E11	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$4.31$	(0.10)	$0.35$	(0.03)	-++-+-
E12	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$3.68$	(0.05)	$0.17$	(0.02)	+-+---
E13	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$3.90$	(0.04)	$0.15$	(0.02)	++----
E14	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$4.03$	(0.09)	$0.30$	(0.04)	++-++-
E15	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$4.03$	(0.13)	$0.38$	(0.06)	+-+++-
E16	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$4.44$	(0.05)	$0.12$	(0.01)	-+-+--
E17	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$4.70$	(0.07)	$0.19$	(0.02)	++-+--
E18	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$3.94$	(0.09)	$0.26$	(0.04)	-+-++-
E19	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$5.23$	(0.43)	$0.57$	(0.13)	--++--
E20	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$5.66$	(0.29)	$0.28$	(0.09)	+-++--
E21	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$6.38$	(0.33)	$0.24$	(0.08)	--+++-
E22	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$5.62$	(0.21)	$0.17$	(0.07)	--+-+-
E23	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$4.86$	(0.43)	$0.70$	(0.13)	-+--+-
E24	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$5.61$	(0.17)	$0.23$	(0.04)	+-+-+-
E25	0.000 (0.000)	0.000 (0.000)	0.000 (0.000)	$6.32$	(0.71)	$0.52$	(0.20)	++--+-

Table 3. Table 3: The number of regions for trees and edges. The number of taxa is N = 6 𝑁 6 N=6 .

	inside mode		outside mode
	tree	edge	tree	edge
$K_{select}$	1	3	104	22
$K_{true}$	104	22	1	3
$K_{all}$	105	25	105	25

Equations87

Z \sim N (θ, 1) .

Z \sim N (θ, 1) .

p (z) := P (Z > z ∣ θ = 0) = \overset{ˉ}{Φ} (z) .

p (z) := P (Z > z ∣ θ = 0) = \overset{ˉ}{Φ} (z) .

p (z, c) := P (Z > z ∣ Z > c, θ = 0) = \overset{ˉ}{Φ} (z) / \overset{ˉ}{Φ} (c),

p (z, c) := P (Z > z ∣ Z > c, θ = 0) = \overset{ˉ}{Φ} (z) / \overset{ˉ}{Φ} (c),

Y \sim N_{m + 1} (μ, I_{m + 1}) .

Y \sim N_{m + 1} (μ, I_{m + 1}) .

\hat{μ} = μ \in \partial R arg min ∥ y - μ ∥,

\hat{μ} = μ \in \partial R arg min ∥ y - μ ∥,

Y^{*} \sim N_{m + 1} (y, I_{m + 1}),

Y^{*} \sim N_{m + 1} (y, I_{m + 1}),

BP (R ∣ y) := P (Y^{*} \in R ∣ y) .

BP (R ∣ y) := P (Y^{*} \in R ∣ y) .

BP (R ∣ y) ≃ \overset{ˉ}{Φ} (β_{0} + β_{1}),

BP (R ∣ y) ≃ \overset{ˉ}{Φ} (β_{0} + β_{1}),

BP (R^{C} ∣ y) = 1 - BP (R ∣ y) ≃ 1 - \overset{ˉ}{Φ} (β_{0} + β_{1}) = \overset{ˉ}{Φ} (- β_{0} - β_{1}) .

BP (R^{C} ∣ y) = 1 - BP (R ∣ y) ≃ 1 - \overset{ˉ}{Φ} (β_{0} + β_{1}) = \overset{ˉ}{Φ} (- β_{0} - β_{1}) .

AU (R ∣ y) := P (β_{0} (Y) > β_{0} ∣ μ = \hat{μ}) = BP ({Y ∣ β_{0} (Y) > β_{0}} ∣ \hat{μ}),

AU (R ∣ y) := P (β_{0} (Y) > β_{0} ∣ μ = \hat{μ}) = BP ({Y ∣ β_{0} (Y) > β_{0}} ∣ \hat{μ}),

AU (R ∣ y) ≃ BP (R^{C} ∣ y^{'}) ≃ \overset{ˉ}{Φ} (β_{0} - β_{1}),

AU (R ∣ y) ≃ BP (R^{C} ∣ y^{'}) ≃ \overset{ˉ}{Φ} (β_{0} - β_{1}),

P\bigl{(}\text{AU}(\mathcal{R}|\bm{Y})<\alpha\mid\bm{\mu}\in\partial\mathcal{R}\bigr{)}\simeq\alpha,

P\bigl{(}\text{AU}(\mathcal{R}|\bm{Y})<\alpha\mid\bm{\mu}\in\partial\mathcal{R}\bigr{)}\simeq\alpha,

AU (R^{C} ∣ y) ≃ \overset{ˉ}{Φ} (- β_{0} + β_{1}) ≃ 1 - AU (R ∣ y) .

AU (R^{C} ∣ y) ≃ \overset{ˉ}{Φ} (- β_{0} + β_{1}) ≃ 1 - AU (R ∣ y) .

Y^{*} \sim N_{m + 1} (y, σ^{2} I_{m + 1}),

Y^{*} \sim N_{m + 1} (y, σ^{2} I_{m + 1}),

BP_{σ^{2}} (R ∣ y) := P_{σ^{2}} (Y^{*} \in R ∣ y),

BP_{σ^{2}} (R ∣ y) := P_{σ^{2}} (Y^{*} \in R ∣ y),

BP_{σ^{2}} (R ∣ y) ≃ \overset{ˉ}{Φ} (β_{0} σ^{- 1} + β_{1} σ) .

BP_{σ^{2}} (R ∣ y) ≃ \overset{ˉ}{Φ} (β_{0} σ^{- 1} + β_{1} σ) .

ψ_{σ^{2}} (R ∣ y) := σ \overset{ˉ}{Φ}^{- 1} (BP_{σ^{2}} (R ∣ y)) ≃ β_{0} + β_{1} σ^{2} .

ψ_{σ^{2}} (R ∣ y) := σ \overset{ˉ}{Φ}^{- 1} (BP_{σ^{2}} (R ∣ y)) ≃ β_{0} + β_{1} σ^{2} .

SI (R ∣ y) := \frac{P ( β _{0} ( Y ) > β _{0} ∣ μ = μ ^ )}{P ( Y \in R ^{C} ∣ μ = μ ^ )} = \frac{AU ( R ∣ y )}{BP ( R ^{C} ∣ μ ^ )} .

SI (R ∣ y) := \frac{P ( β _{0} ( Y ) > β _{0} ∣ μ = μ ^ )}{P ( Y \in R ^{C} ∣ μ = μ ^ )} = \frac{AU ( R ∣ y )}{BP ( R ^{C} ∣ μ ^ )} .

SI (R ∣ y) ≃ \frac{Φ ˉ ( β _{0} - β _{1} )}{Φ ˉ ( - β _{1} )},

SI (R ∣ y) ≃ \frac{Φ ˉ ( β _{0} - β _{1} )}{Φ ˉ ( - β _{1} )},

P\bigl{(}\text{SI}(\mathcal{R}|\bm{Y})<\alpha\mid\bm{Y}\in\mathcal{R}^{C},\bm{\mu}\in\partial\mathcal{R}\bigr{)}\simeq\alpha,

P\bigl{(}\text{SI}(\mathcal{R}|\bm{Y})<\alpha\mid\bm{Y}\in\mathcal{R}^{C},\bm{\mu}\in\partial\mathcal{R}\bigr{)}\simeq\alpha,

SI (R^{C} ∣ y) = \frac{AU ( R ^{C} ∣ y )}{BP ( R ∣ μ ^ )} ≃ \frac{Φ ˉ ( - β _{0} + β _{1} )}{Φ ˉ ( β _{1} )} .

SI (R^{C} ∣ y) = \frac{AU ( R ^{C} ∣ y )}{BP ( R ∣ μ ^ )} ≃ \frac{Φ ˉ ( - β _{0} + β _{1} )}{Φ ˉ ( β _{1} )} .

SI^{'} (R ∣ y) := {SI (R ∣ y) 1 - SI (R^{C} ∣ y) y \in R^{C} y \in R

SI^{'} (R ∣ y) := {SI (R ∣ y) 1 - SI (R^{C} ∣ y) y \in R^{C} y \in R

β_{0}

β_{0}

β_{1}

SI (R ∣ y)

SI (R ∣ y)

SI (R^{C} ∣ y)

ξ_{t i} = lo g p_{i} (x_{t}; \hat{θ}_{i}), t = 1, \dots, n, i = 1, \dots, 105,

ξ_{t i} = lo g p_{i} (x_{t}; \hat{θ}_{i}), t = 1, \dots, n, i = 1, \dots, 105,

ℓ_{i} (\hat{θ}_{i}^{*}; X_{n^{'}}^{*}) \approx ℓ_{i} (\hat{θ}_{i}; X_{n^{'}}^{*}) = t = 1 \sum n w_{t}^{*} ξ_{t i},

ℓ_{i} (\hat{θ}_{i}^{*}; X_{n^{'}}^{*}) \approx ℓ_{i} (\hat{θ}_{i}; X_{n^{'}}^{*}) = t = 1 \sum n w_{t}^{*} ξ_{t i},

BP (T i, n^{'}) = # {\hat{i}^{* b} = i, b = 1, \dots, B} / B .

BP (T i, n^{'}) = # {\hat{i}^{* b} = i, b = 1, \dots, B} / B .

Y^{*} = V_{n}^{- 1/2} L_{n^{'}}^{*},

Y^{*} = V_{n}^{- 1/2} L_{n^{'}}^{*},

\text{var}\Bigl{(}\ell_{i}(\bm{\hat{\theta}}_{i};\mathcal{X}_{n})-\ell_{j}(\bm{\hat{\theta}}_{j};\mathcal{X}_{n})\Bigr{)}\approx\|\bm{\xi}_{i}-\bm{\xi}_{j}\|^{2},

\text{var}\Bigl{(}\ell_{i}(\bm{\hat{\theta}}_{i};\mathcal{X}_{n})-\ell_{j}(\bm{\hat{\theta}}_{j};\mathcal{X}_{n})\Bigr{)}\approx\|\bm{\xi}_{i}-\bm{\xi}_{j}\|^{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Selective Inference for Testing Trees and Edges in Phylogenetics

Hidetoshi Shimodaira1,3

Yoshikazu Terada2,3

[email protected] and [email protected]

1Graduate School of Informatics, Kyoto University, Yoshida Honmachi, Sakyo-ku, Kyoto, 606-8501, Japan

2Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama-cho, Toyonaka, Osaka 560-8531, Japan

3Mathematical Statistics Team, RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan

Abstract

Selective inference is considered for testing trees and edges in phylogenetic tree selection from molecular sequences. This improves the previously proposed approximately unbiased test by adjusting the selection bias when testing many trees and edges at the same time. The newly proposed selective inference $p$ -value is useful for testing selected edges to claim that they are significantly supported if $p>1-\alpha$ , whereas the non-selective $p$ -value is still useful for testing candidate trees to claim that they are rejected if $p<\alpha$ . The selective $p$ -value controls the type-I error conditioned on the selection event, whereas the non-selective $p$ -value controls it unconditionally. The selective and non-selective approximately unbiased $p$ -values are computed from two geometric quantities called signed distance and mean curvature of the region representing tree or edge of interest in the space of probability distributions. These two geometric quantities are estimated by fitting a model of scaling-law to the non-parametric multiscale bootstrap probabilities. Our general method is applicable to a wider class of problems; phylogenetic tree selection is an example of model selection, and it is interpreted as the variable selection of multiple regression, where each edge corresponds to each predictor. Our method is illustrated in a previously controversial phylogenetic analysis of human, rabbit and mouse.

Statistical hypothesis testing,

Multiple testing,

Selection bias,

Model selection,

Akaike information criterion,

Bootstrap resampling,

Hierarchical clustering,

Variable selection,

keywords:

and

1 Introduction

A phylogenetic tree is a diagram showing evolutionary relationships among species, and a tree topology is a graph obtained from the phylogentic tree by ignoring the branch lengths. The primary objective of any phylogenetic analysis is to approximate a topology that reflects the evolution history of the group of organisms under study. Branches of the tree are also referred to as edges in the tree topology. Given a rooted tree topology, or a unrooted tree topology with an outgroup, each edge splits the tree so that it defines the clade consisting of all the descendant species. Therefore, edges in a tree topology represent clades of species. Because the phylogenetic tree is commonly inferred from molecular sequences, it is crucial to assess the statistical confidence of the inference. In phylogenetics, it is a common practice to compute confidence levels for tree topologies and edges. For example, the bootstrap probability (Felsenstein, 1985) is the most commonly used confidence measure, and other methods such as the Shimodaira-Hasegawa test (Shimodaira and Hasegawa, 1999) and the multiscale bootstrap method (Shimodaira, 2002) are also often used. However, these conventional methods are limited in how well they address the issue of multiplicity when there are many alternative topologies and edges. Herein, we discuss a new approach, selective inference (SI), that is designed to address the issue of multiplicity.

For illustrating the idea of selective inference, we first look at a simple example of 1-dimensional normal random variable $Z$ with unknown mean $\theta\in\mathbb{R}$ and variance 1:

[TABLE]

Observing $Z=z$ , we would like to test the null hypothesis $H_{0}:\theta\leq 0$ against the alternative hypothesis $H_{1}:\theta>0$ . We denote the cumulative distribution function of $N(0,1)$ as $\Phi(x)$ and define the upper tail probability as $\bar{\Phi}(x)=1-\Phi(x)=\Phi(-x)$ . Then, the ordinary (i.e., non-selective) inference leads to the $p$ -value of the one-tailed $z$ -test as

[TABLE]

What happens when we test many hypotheses at the same time? Consider random variables $Z_{i}\sim N(\theta_{i},1)$ , $i=1,\ldots,K_{\text{all}}$ , not necessarily independent, with null hypotheses $\theta_{i}\leq 0$ , where $K_{\text{true}}$ hypotheses are actually true. To control the number of falsely rejecting the $K_{\text{true}}$ hypotheses, there are several multiplicity adjusted approaches such as the family-wise error rate (FWER) and the false discovery rate (FDR). Instead of testing all the $K_{\text{all}}$ hypotheses, selective inference (SI) allows for $K_{\text{select}}$ hypotheses with $z_{i}>c_{i}$ for constants $c_{i}$ specified in advance. This kind of selection is very common in practice (e.g., publication bias), and it is called as the file drawer problem by Rosenthal (1979). Instead of controlling the multiplicity of testing, SI alleviates it by reducing the number of tests. The mathematical formulation of SI is easier than FWER and FDR in the sense that hypotheses can be considered separately instead of simultaneously. Therefore, we simply write $z>c$ by dropping the index $i$ for one of the hypotheses. In selective inference, the selection bias is adjusted by considering the conditional probability given the selection event, which leads to the following $p$ -value (Fithian, Sun and Taylor, 2014; Tian and Taylor, 2018)

[TABLE]

where $p(z)$ of eq. (2) is divided by the selection probability $P(Z>c\mid\theta=0)=\bar{\Phi}(c)$ . In the case of $c=0$ , this corresponds to the two-tailed $z$ -test, because the selection probability is $\bar{\Phi}(0)=0.5$ and $p(z,c)=2p(z)$ . For significance level $\alpha$ (we use $\alpha=0.05$ unless otherwise stated), it properly controls the type-I error conditioned on the selection event as $P(p(Z,c)<\alpha\mid Z>c,\theta=0)=\alpha$ , while the non-selective $p$ -value violates the type-I error as $P(p(Z)<\alpha\mid Z>c,\theta=0)=\alpha/\bar{\Phi}(c)>\alpha$ . The selection bias can be very large when $\bar{\Phi}(c)\ll 1$ (i.e. $c\gg 0$ ), or $K_{\text{select}}\ll K_{\text{all}}$ .

Selective inference has been mostly developed for inferences after model selection (Taylor and Tibshirani, 2015; Tibshirani et al., 2016), particularly variable selection in regression settings such as lasso (Tibshirani, 1996). Recently, Terada and Shimodaira (2017) developed a general method for selective inference by adjusting the selection bias in the approximately unbiased (AU) $p$ -value computed by the multiscale bootstrap method (Shimodaira, 2002, 2004, 2008). This new method can be used to compute, for example, confidence intervals of regression coefficients in lasso (figure 1). In this paper, we apply this method to phylogenetic inference for computing proper confidence levels of tree topologies (dendrograms) and edges (clades or clusters) of species. As far as we know, this is the first attempt to consider selective inference in phylogenetics. Our selective inference method is implemented in software scaleboot (Shimodaira, 2019) working jointly with CONSEL (Shimodaira and Hasegawa, 2001) for phylogenetics, and it is also implemented in a new version of pvclust (Suzuki and Shimodaira, 2006) for hierarchical clustering, where only edges appeared in the observed tree are “selected” for computing $p$ -values. Although our argument is based on the rigorous theory of mathematical statistics in Terada and Shimodaira (2017), a self-contained illustration is presented in this paper for the theory as well as the algorithm of selective inference.

Phylogenetic tree selection is an example of model selection. Since each tree can be specified as a combination of edges, tree selection can be interpreted as the variable selection of multiple regression, where edges correspond to the predictors of regression (Shimodaira, 2001; Shimodaira and Hasegawa, 2005). Because all candidate trees have the same number of model parameters, the maximum likelihood (ML) tree is obtained by comparing log-likelihood values of trees (Felsenstein, 1981). In order to adjust the model complexity by the number of parameters in general model selection, we compare Akaike Information Criterion (AIC) values of candidate models (Akaike, 1974). AIC is used in phylogenetics for selecting the substitution model (Posada and Buckley, 2004). There are several modifications of AIC that allow for model selection. These include the precise estimation of the complexity term known as Takeuchi Information Criterion (Burnham and Anderson, 2002; Konishi and Kitagawa, 2008), and adaptations for incomplete data (Shimodaira and Maeda, 2018) and covariate-shift data (Shimodaira, 2000). AIC and all these modifications are derived for estimating the expected Kullback-Leibler divergence between the unknown true distribution and the estimated probability distribution on the premise that the model is misspecified. When using regression model for prediction purpose, it may be sufficient to find only the best model which minimizes the AIC value. Considering random variations of dataset, however, it is obvious in phylogenetics that the ML tree does not necessarily represent the true history of evolution. Therefore, Kishino and Hasegawa (1989) proposed a statistical test whether two log-likelihood values differ significantly (also known as Kishino-Hasegawa test). The log-likelihood difference is often not significant, because its variance can be very large for non-nested models when the divergence between two probability distributions is large; see eq. (26) in Section 6.1. The same idea of model selection test whether two AIC values differ significantly has been proposed independently in statistics (Linhart, 1988) and econometrics (Vuong, 1989). Another method of model selection test (Efron, 1984) allows for the comparison of two regression models with an adjusted bootstrap confidence interval corresponding to the AU $p$ -value. For testing which model is better than the other, the null hypothesis in the model selection test is that the two models are equally good in terms of the expected value of AIC on the premise that both models are misspecified. Note that the null hypothesis is whether the model is correctly specified or not in the traditional hypothesis testing methods including the likelihood ratio test for nested models and the modified likelihood ratio test for non-nested models (Cox, 1962). The model selection test is very different from these traditional settings. For comparing AIC values of more than two models, a multiple comparisons method is introduced to the model selection test (Shimodaira, 1998; Shimodaira and Hasegawa, 1999), which computes the confidence set of models. But the multiple comparisons method is conservative by nature, leading to more false negatives than expected, because it considers the worst scenario, called the least favorable configuration. On the other hand, the model selection test (designed for two models) and bootstrap probability (Felsenstein, 1985) lead to more false positives than expected when comparing more than two models (Shimodaira and Hasegawa, 1999; Shimodaira, 2002). The AU $p$ -value mentioned earlier has been developed for solving this problem, and we are going to upgrade it for selective inference.

2 Phylogenetic Inference

For illustrating phylogenetic inference methods, we analyze a dataset consisting of mitochondrial protein sequences of six mammalian species with $n=3414$ amino acids ( $n$ is treated as sample size). The taxa are labelled as 1 $=$ Homo sapiens (human), 2 $=$ Phoca vitulina (seal), 3 $=$ Bos taurus (cow), 4 $=$ Oryctolagus cuniculus (rabbit), 5 $=$ Mus musculus (mouse), and 6 $=$ Didelphis virginiana (opossum). The dataset will be denoted as $\mathcal{X}_{n}=(\bm{x}_{1},\ldots,\bm{x}_{n})$ . The software package PAML (Yang, 1997) was used to calculate the site-wise log-likelihoods for trees. The mtREV model (Adachi and Hasegawa, 1996) was used for amino acid substitutions, and the site-heterogeneity was modeled by the discrete-gamma distribution (Yang, 1996). The dataset and evolutionary model are similar to previous publications (Shimodaira and Hasegawa, 1999; Shimodaira, 2001, 2002), thus allowing our proposed method to be easily compared with conventional methods.

The number of unrooted trees for six taxa is 105. These trees are reordered by their likelihood values and labelled as T1, T2, $\ldots$ , T105. T1 is the ML tree as shown in figure 2 and its tree topology is represented as (((1(23))4)56). There are three internal branches (we call them as edges) in T1, which are labelled as E1, E2 and E3. For example, E1 splits the six taxa as $\{23|1456\}$ and the partition of six taxa is represented as -++---, where +/- indicates taxa $1,\ldots,6$ from left to right and ++ indicates the clade $\{23\}$ (we set - for taxon 6, since it is treated as the outgroup). There are 25 edges in total, and each tree is specified by selecting three edges from them, although not all the combinations of three edges are allowed.

The result of phylogenetic analysis is summarized in table 1 for trees and table 2 for edges. Three types of $p$ -values are computed for each tree as well as for each edge. BP is the bootstrap probability (Felsenstein, 1985) and AU is the approximately unbiased $p$ -value (Shimodaira, 2002). Bootstrap probabilities are computed by the non-parametric bootstrap resampling (Efron, 1979) described in Section 6.1. The theory and the algorithm of BP and AU will be reviewed in Section 3. Since we are testing many trees and edges at the same time, there is potentially a danger of selection bias. The issue of selection bias has been discussed in Shimodaira and Hasegawa (1999) for introducing the method of multiple comparisons of log-likelihoods (also known as Shimodaira-Hasegawa test) and in Shimodaira (2002) for introducing AU test. However, these conventional methods are only taking care of the multiplicity of comparing many log-likelihood values for computing just one $p$ -value instead of many $p$ -values at the same time. Therefore, we intend to further adjust the AU $p$ -value by introducing the selective inference $p$ -value, denoted as SI. The theory and the algorithm of SI will be explained in Section 4 based on the geometric theory given in Section 3. After presenting the methods, we will revisit the phyloegnetic inference in Section 4.3.

For developing the geometric theory in Sections 3 and 4, we formulate tree selection as a mathematical formulation known as the problem of regions (Efron, Halloran and Holmes, 1996; Efron and Tibshirani, 1998). For better understanding the geometric nature of the theory, the problem of regions is explained below for phylogenetic inference, although the algorithm is simple enough to be implemented without understanding the theory. Considering the space of probability distributions (Amari and Nagaoka, 2007), the parametric models for trees are represented as manifolds in the space. The dataset (or the empirical distribution) can also be represented as a “data point” $X$ in the space, and the ML estimates for trees are represented as projections to the manifolds. This is illustrated in the visualization of probability distributions of figure 3A using log-likelihood vectors of models (Shimodaira, 2001), where models are simply indicated as red lines from the origin; see Section 6.2 for details. This visualization may be called as model map. The point $X$ is actually reconstructed as the minimum full model containing all the trees as submodels, and the Kullback-Leibler divergence between probability distributions is represented as the squared distance between points; see eq. (27). Computation of $X$ is analogous to the Bayesian model averaging, but based on the ML method. For each tree, we can think of a region in the space so that this tree becomes the ML tree when $X$ is included in the region. The regions for T1, T2 and T3 are illustrated in figure 3B, and the region for E2 is the union of these three regions.

In figure 3A, $X$ is very far from any of the tree models, suggesting that all the models are wrong; the likelihood ratio statistic for testing T1 against the full model is 113.4, which is highly significant as $\chi^{2}_{8}$ (Shimodaira, 2001, Section 5). Instead of testing whether tree models are correct or not, we test whether models are significantly better than the others. As seen in figure 3B, $X$ is in the region for T1, meaning that the model for T1 is better than those for the other trees. For convenience, observing $X$ in the region for T1, we state that T1 is supported by the data. Similarly, $X$ is in the region for E2 that consists of the three regions for T1, T2, T3, thus indicating that E2 is supported by the data. Although T1 and E2 are supported by the data, there is still uncertainty as to whether the true evolutionary history of lineages is depicted because the location of $X$ fluctuates randomly. Therefore, statistical confidence of the outcome needs to be assessed. A mathematical procedure for statistically evaluating the outcome is provided in the following sections.

3 Non-Selective Inference for the Problem of Regions

3.1 The Problem of Regions

For developing the theory, we consider $(m+1)$ -dimensional multivariate normal random vector $\bm{Y},$ $m\geq 0$ , with unknown mean vector $\bm{\mu}\in\mathbb{R}^{m+1}$ and the identity variance matrix $\bm{I}_{m+1}$ :

[TABLE]

A region of interest such as tree and edge is denoted as $\mathcal{R}\subset\mathbb{R}^{m+1}$ , and its complement set is denoted as $\mathcal{R}^{C}=\mathbb{R}^{m+1}\setminus\mathcal{R}$ . There are $K_{\text{all}}$ regions $\mathcal{R}_{i}$ , $i=1,\ldots,K_{\text{all}}$ , and we simply write $\mathcal{R}$ for one of them by dropping the index $i$ . Observing $\bm{Y}=\bm{y}$ , the null hypothesis $H_{0}:\bm{\mu}\in\mathcal{R}$ is tested against the alternative hypothesis $H_{1}:\bm{\mu}\in\mathcal{R}^{C}$ . This setting is called problem of regions, and the geometric theory for non-selective inference for slightly generalized settings (e.g., exponential family of distributions) has been discussed in Efron and Tibshirani (1998); Shimodaira (2004). This theory allows arbitrary shape of $\mathcal{R}$ without assuming a particular shape such as half-space or sphere, and only requires the expression (29) of Section 6.3.

The problem of regions is well described by geometric quantities (figure 4). Let $\bm{\hat{\mu}}$ be the projection of $\bm{y}$ to the boundary surface $\partial\mathcal{R}$ defined as

[TABLE]

and $\beta_{0}$ be the signed distance defined as $\beta_{0}=\|\bm{y}-\bm{\hat{\mu}}\|>0$ for $\bm{y}\in\mathcal{R}^{C}$ and $\beta_{0}=-\|\bm{y}-\bm{\hat{\mu}}\|\leq 0$ for $\bm{y}\in\mathcal{R}$ ; see figures 4A and 4B, respectively. A large $\beta_{0}$ indicates the evidence for rejecting $H_{0}:\bm{\mu}\in\mathcal{R}$ , but computation of $p$ -value will also depend on the shape of $\mathcal{R}$ . There should be many parameters for defining the shape, but we only need the mean curvature of $\partial\mathcal{R}$ at $\bm{\hat{\mu}}$ , which represents the amount of surface bending. It is denoted as $\beta_{1}\in\mathbb{R}$ , and defined in (30).

Geometric quantities $\beta_{0}$ and $\beta_{1}$ of regions for trees (T1 $,\ldots,$ T105) and edges (E1 $,\ldots,$ E25) are plotted in figure 5, and these values are also found in tables 1 and 2. Although the phylogenetic model of evolution for the molecular dataset $\mathcal{X}_{n}=(\bm{x}_{1},\ldots,\bm{x}_{n})$ is different from the multivariate normal model (4) for $\bm{y}$ , the multiscale bootstrap method of Section 3.4 estimates $\beta_{0}$ and $\beta_{1}$ using the non-parametric bootstrap probabilities (Section 6.1) with bootstrap replicates $\mathcal{X}^{*}_{n^{\prime}}$ for several values of sample size $n^{\prime}$ .

3.2 Bootstrap Probability

For simulating (4) from $\bm{y}$ , we may generate replicates $\bm{Y}^{*}$ from the bootstrap distribution (figure 4C)

[TABLE]

and define bootstrap probability (BP) of $\mathcal{R}$ as the probability of $\bm{Y}^{*}$ being included in the region $\mathcal{R}$ :

[TABLE]

$\text{BP}(\mathcal{R}|\bm{y})$ can be interpreted as the Bayesian posterior probability $P(\bm{\mu}\in\mathcal{R}|\bm{y})$ , because, by assuming the flat prior distribution $\pi(\bm{\mu})=$ constant, the posterior distribution $\bm{\mu}|\bm{y}\sim N_{m+1}(\bm{y},\bm{I}_{m+1})$ is identical to the distribution of $\bm{Y}^{*}$ in (5). An interesting consequence of the geometric theory of Efron and Tibshirani (1998) is that BP can be expressed as

[TABLE]

where $\simeq$ indicates the second order asymptotic accuracy, meaning that the equality is correct up to $O_{p}(n^{-1/2})$ with error of order $O_{p}(n^{-1})$ ; see Section 6.3.

For understanding the formula (7), assume that $\mathcal{R}$ is a half space so that $\partial\mathcal{R}$ is flat and $\beta_{1}=0$ . Since we only have to look at the axis orthogonal to $\partial\mathcal{R}$ , the distribution of signed distance is identified as (1) with $\beta_{0}=z$ . The bootstrap distribution for (1) is $Z^{*}\sim N(z,1)$ , and bootstrap probability is expressed as $P(Z^{*}\leq 0|z)=\bar{\Phi}(z)$ . Therefore, we have $\text{BP}(\mathcal{R}|\bm{y})=\bar{\Phi}(\beta_{0})$ . For general $\mathcal{R}$ with curved $\partial\mathcal{R}$ , the formula (7) adjusts the bias caused by $\beta_{1}$ . As seen in figure 4C, $\mathcal{R}$ becomes smaller for $\beta_{1}>0$ than $\beta_{1}=0$ , and BP becomes smaller.

BP of $\mathcal{R}^{C}$ is closely related to BP of $\mathcal{R}$ . From the definition,

[TABLE]

The last expression also implies that the signed distance and the mean curvature of $\mathcal{R}^{C}$ is $-\beta_{0}$ and $-\beta_{1}$ , respectively; this relation is also obtained by reversing the sign of $v$ in (29).

3.3 Approximately Unbiased Test

Although $\text{BP}(\mathcal{R}|\bm{y})$ may work as a Bayesian confidence measure, we would like to have a frequentist confidence measure for testing $H_{0}:\bm{\mu}\in\mathcal{R}$ against $H_{1}:\bm{\mu}\in\mathcal{R}^{C}$ . The signed distance of $\bm{Y}$ is denoted as $\beta_{0}(\bm{Y})$ , and consider the region $\{\bm{Y}\mid\beta_{0}(\bm{Y})>\beta_{0}\}$ in which the signed distance is larger than the observed value $\beta_{0}=\beta_{0}(\bm{y})$ . Similar to (2), we then define an approximately unbiased (AU) $p$ -value as

[TABLE]

where the probability is calculated for $\bm{Y}\sim N_{m+1}(\bm{\hat{\mu}},\bm{I}_{m+1})$ as illustrated in figure 4D. The shape of the region $\{\bm{Y}\mid\beta_{0}(\bm{Y})>\beta_{0}\}$ is very similar to the shape of $\mathcal{R}^{C}$ ; the difference is in fact only $O_{p}(n^{-1})$ . Let us think of a point $\bm{y}^{\prime}$ with signed distance $-\beta_{0}$ (shown as $\bm{y}$ in figure 4B). Then we have

[TABLE]

where the last expression is obtained by substituting $(-\beta_{0},\beta_{1})$ for $(\beta_{0},\beta_{1})$ in (8). This formula computes AU from $(\beta_{0},\beta_{1})$ . An intuitive interpretation of (10) is explained in Section 6.4.

In non-selective inference, $p$ -values are computed using formula (10). If $\text{AU}(\mathcal{R}|\bm{y})<\alpha$ , the null hypothesis $H_{0}:\bm{\mu}\in\mathcal{R}$ is rejected and the alternative hypothesis $H_{1}:\bm{\mu}\in\mathcal{R}^{C}$ is accepted. This test procedure is approximately unbiased, because it controls the non-selective type-I error as

[TABLE]

and the rejection probability increases as $\bm{\mu}$ moves away from $\mathcal{R}$ , while it decreases as $\bm{\mu}$ moves into $\mathcal{R}$ .

Exchanging the roles of $\mathcal{R}$ and $\mathcal{R}^{C}$ also allows for another hypothesis testing. AU of $\mathcal{R}^{C}$ is obtained from (9) by reversing the inequality as $\text{AU}(\mathcal{R}^{C}|\bm{y})=\text{BP}(\{\bm{Y}\mid\beta_{0}(\bm{Y})<\beta_{0}\}|\bm{\hat{\mu}})=1-\text{AU}(\mathcal{R}|\bm{y})$ . This is also confirmed by substituting $(-\beta_{0},-\beta_{1})$ , i.e., the geometric quantities of $\mathcal{R}^{C}$ , for $(\beta_{0},\beta_{1})$ in (10) as

[TABLE]

If $\text{AU}(\mathcal{R}^{C}|\bm{y})<\alpha$ or equivalently $\text{AU}(\mathcal{R}|\bm{y})>1-\alpha$ , then we reject $H_{0}:\bm{\mu}\in\mathcal{R}^{C}$ and accept $H_{1}:\bm{\mu}\in\mathcal{R}$ .

3.4 Multiscale Bootstrap

In order to estimate $\beta_{0}$ and $\beta_{1}$ from bootstrap probabilities, we consider a generalization of (5) as

[TABLE]

for a variance $\sigma^{2}>0$ , and define multiscale bootstrap probability of $\mathcal{R}$ as

[TABLE]

where $P_{\sigma^{2}}$ indicates the probability with respect to (13).

Although our theory is based on the multivariate normal model, the actual implementation of the algorithm uses the non-parametric bootstrap probabilities in Section 6.1. To fill the gap between the two models, we consider a non-linear transformation $\bm{f}_{n}$ so that the multivariate normal model holds at least approximately for $\bm{y}=\bm{f}_{n}(\mathcal{X}_{n})$ and $\bm{Y}^{*}=\bm{f}_{n}(\mathcal{X}^{*}_{n^{\prime}})$ . An example of $\bm{f}_{n}$ is given in (25) for phylogenetic inference. Surprisingly, a specification of $\bm{f}_{n}$ is not required for computing $p$ -values, but we simply assume the existence of such a transformation; this property may be called as “bootstrap trick”. For phylogenetic inference, we compute the non-parametric bootstrap probabilities by (24) and substitute these values for (14) with $\sigma^{2}=n/n^{\prime}$ .

For estimating $\beta_{0}$ and $\beta_{1}$ , we need to have a scaling law which explains how $\text{BP}_{\sigma^{2}}$ depends on the scale $\sigma$ . We rescale (13) by multiplying $\sigma^{-1}$ so that $\sigma^{-1}\bm{Y}^{*}\sim N_{m+1}(\sigma^{-1}\bm{y},\bm{I}_{m+1})$ has the variance $\sigma^{2}=1$ . $\bm{y}$ and $\mathcal{R}$ are now resaled by the factor $\sigma^{-1}$ , which amounts to signed distance $\beta_{0}\sigma^{-1}$ and mean curvature $\beta_{1}\sigma$ (Shimodaira, 2004). Therefore, by substituting $(\beta_{0}\sigma^{-1},\beta_{1}\sigma)$ for $(\beta_{0},\beta_{1})$ in (7), we obtain

[TABLE]

For better illustrating how $\text{BP}_{\sigma^{2}}$ depends on $\sigma^{2}$ , we define

[TABLE]

We can estimate $\beta_{0}$ and $\beta_{1}$ as regression coefficients by fitting the linear model (16) in terms of $\sigma^{2}$ to the observed values of non-parametric bootstrap probabilities (figure 6). Interestingly, (10) is rewritten as $\text{AU}(\mathcal{R}|\bm{y})\simeq\bar{\Phi}(\psi_{-1}(\mathcal{R}|\bm{y}))$ by formally letting $\sigma^{2}=-1$ in the last expression of (16), meaning that AU corresponds to $n^{\prime}=-n$ . Although $\sigma^{2}$ should be positive in (15), we can think of negative $\sigma^{2}$ in $\beta_{0}+\beta_{1}\sigma^{2}$ . See Section 6.5 for details of model fitting and extrapolation to negative $\sigma^{2}$ .

4 Selective Inference for the Problem of Regions

4.1 Approximately Unbiased Test for Selective Inference

In order to argue selective inference for the problem of regions, we have to specify the selection event. Let us consider a selective region $\mathcal{S}\subset\mathcal{R}^{m+1}$ so that we perform the hypothesis testing only when $\bm{y}\in\mathcal{S}$ . Terada and Shimodaira (2017) considered a general shape of $\mathcal{S}$ , but here we treat only two special cases of $\mathcal{S}=\mathcal{R}^{C}$ and $\mathcal{S}=\mathcal{R}$ ; see Section 6.6. Our problem is formulated as follows. Observing $\bm{Y}=\bm{y}$ from the multivariate normal model (4), we first check whether $\bm{y}\in\mathcal{R}^{C}$ or $\bm{y}\in\mathcal{R}$ . If $\bm{y}\in\mathcal{R}^{C}$ and we are interested in the null hypothesis $H_{0}:\bm{\mu}\in\mathcal{R}$ , then we may test it against the alternative hypothesis $H_{1}:\bm{\mu}\in\mathcal{R}^{C}$ . If $\bm{y}\in\mathcal{R}$ and we are interested in the null hypothesis $H_{0}:\bm{\mu}\in\mathcal{R}^{C}$ , then we may test it against the alternative hypothesis $H_{1}:\bm{\mu}\in\mathcal{R}$ . In this paper, the former case ( $\bm{y}\in\mathcal{R}^{C}$ , and so $\beta_{0}>0$ ) is called as outside mode, and the latter case ( $\bm{y}\in\mathcal{R}$ , and so $\beta_{0}\leq 0$ ) is called as inside mode. We do not know which of the two modes of testing is performed until we observe $\bm{y}$ .

Let us consider the outside mode by assuming that $\bm{y}\in\mathcal{R}^{C}$ , where $\beta_{0}>0$ . Recalling that $p(z,c)=p(z)/\bar{\Phi}(c)$ in Section 1, we divide $\text{AU}(\mathcal{R}|\bm{y})$ by the selection probability to define a selective inference $p$ -value as

[TABLE]

From the definition, $\text{SI}(\mathcal{R}|\bm{y})\in(0,1)$ , because $\{\bm{Y}\mid\beta_{0}(\bm{Y})>\beta_{0}\}\subset\mathcal{R}^{C}$ for $\beta_{0}>0$ . This $p$ -value is computed from $(\beta_{0},\beta_{1})$ by

[TABLE]

where $\text{BP}(\mathcal{R}^{C}|\bm{\hat{\mu}})=\bar{\Phi}(-\beta_{1})$ is obtained by substituting $(0,\beta_{1})$ for $(\beta_{0},\beta_{1})$ in (8). An intuitive justification of (18) is explained in Section 6.4.

For the outside mode of selective inference, $p$ -values are computed using formula (18). If $\text{SI}(\mathcal{R}|\bm{y})<\alpha$ , then reject $H_{0}:\bm{\mu}\in\mathcal{R}$ and accept $H_{1}:\bm{\mu}\in\mathcal{R}^{C}$ . This test procedure is approximately unbiased, because it controls the selective type-I error as

[TABLE]

and the rejection probability increases as $\bm{\mu}$ moves away from $\mathcal{R}$ , while it decreases as $\bm{\mu}$ moves into $\mathcal{R}$ .

Now we consider the inside mode by assuming that $\bm{y}\in\mathcal{R}$ , where $\beta_{0}\leq 0$ . SI of $\mathcal{R}^{C}$ is obtained from (17) by exchanging the roles of $\mathcal{R}$ and $\mathcal{R}^{C}$ .

[TABLE]

For the inside mode of selective inference, $p$ -values are computed using formula (20). If $\text{SI}(\mathcal{R}^{C}|\bm{y})<\alpha$ , then reject $H_{0}:\bm{\mu}\in\mathcal{R}^{C}$ and accept $H_{1}:\bm{\mu}\in\mathcal{R}$ . Unlike the non-selective $p$ -value $\text{AU}(\mathcal{R}^{C}|\bm{y})$ , $\text{SI}(\mathcal{R}^{C}|\bm{y})<\alpha$ is not equivalent to $\text{SI}(\mathcal{R}|\bm{y})>1-\alpha$ , because $\text{SI}(\mathcal{R}|\bm{y})+\text{SI}(\mathcal{R}^{C}|\bm{y})\neq 1$ . For convenience, we define

[TABLE]

so that $\text{SI}^{\prime}>1-\alpha$ implies $\text{SI}(\mathcal{R}^{C}|\bm{y})<\alpha$ . In our numerical examples of figure 5, tables 1 and 2, $\text{SI}^{\prime}$ is simply denoted as SI. We do not need to consider (21) for BP and AU, because $\text{BP}^{\prime}(\mathcal{R}|\bm{y})=\text{BP}(\mathcal{R}|\bm{y})$ and $\text{AU}^{\prime}(\mathcal{R}|\bm{y})=\text{AU}(\mathcal{R}|\bm{y})$ from (8) and (12).

4.2 Shortcut Computation of SI

We can compute SI from BP and AU. This will be useful for reanalyzing the results of previously published researches. Let us write $\text{BP}=\text{BP}(\mathcal{R}|\bm{y})$ and $\text{AU}=\text{AU}(\mathcal{R}|\bm{y})$ . From (7) and (10), we have

[TABLE]

We can compute SI from $\beta_{0}$ and $\beta_{1}$ by (18) or (20). More directly, we may compute

[TABLE]

4.3 Revisiting the Phylogenetic Inference

In this section, the analytical procedure outlined in Section 2 is used to determine relationships among human, mouse, and rabbit. The question is: Which of mouse or human is closer to rabbit? The traditional view (Novacek, 1992) is actually supporting E6, the clade of rabbit and mouse, which is consistent with T4, T5 and T7. Based on molecular analysis, Graur, Duret and Gouy (1996) strongly suggested that rabbit is closer to human than mouse, thus supporting E2, which is consistent with T1, T2 and T3. However, Halanych (1998) criticized it by pointing out that E2 is an artifact caused by the long branch attraction (LBA) between mouse and opossum. In addition, Shimodaira and Hasegawa (1999); Shimodaira (2002) suggested that T7 is not rejected by multiplicity adjusted tests. Shimodaira and Hasegawa (2005) showed that T7 becomes the ML tree by resolving the LBA using a larger dataset with more taxa. Although T1 is the ML tree based on the dataset with fewer taxa, T7 is presumably the true tree as indicated by later researches. With these observations in mind, we retrospectively interpret $p$ -values in tables 1 and 2.

The results are shown below for the two test modes (inside and outside) as defined in Section 4.1. The extent of multiplicity and selection bias depends on the number of regions under consideration, thus these numbers are considered for interpreting the results. The numbers of regions related to trees and edges are summarized in table 3; see Section 6.7 for details.

In inside mode, the null hypothesis $H_{0}:\bm{\mu}\in\mathcal{R}_{i}^{C}$ is tested against the alternative hypothesis $H_{1}:\bm{\mu}\in\mathcal{R}_{i}$ for $\bm{y}\in\mathcal{R}_{i}$ (i.e., $\beta_{0}\leq 0$ ). This applies to the regions for T1, E1, E2 and E3, and they are supported by the data in the sense mentioned in the last paragraph of Section 2. When $H_{0}$ is rejected by a test procedure, it is claimed that $\mathcal{R}_{i}$ is significantly supported by the data, indicating $H_{1}$ holds true. For convenience, the null hypothesis $H_{0}$ is said like E1 is not true, and the alternative hypothesis $H_{1}$ is said like E1 is true; then rejection of $H_{0}$ implies that E1 is true. This procedure looks unusual, but makes sense when both $\mathcal{R}_{i}$ and $\mathcal{R}_{i}^{C}$ are regions with nonzero volume. Note that selection bias can be very large in the sense that $K_{\text{select}}/K_{\text{all}}\approx 0$ for many taxa, and non-selective tests may lead to many false positives because $K_{\text{true}}/K_{\text{all}}\approx 1$ . Therefore selective inference should be used in inside mode.

In outside mode, the null hypothesis $H_{0}:\bm{\mu}\in\mathcal{R}_{i}$ is tested against the alternative hypothesis $H_{1}:\bm{\mu}\in\mathcal{R}_{i}^{C}$ for $\bm{y}\in\mathcal{R}_{i}^{C}$ (i.e., $\beta_{0}>0$ ). This applies to the regions for T2, …, T105, and E4, …, E25, and they are not supported by the data. When $H_{0}$ is rejected by a test procedure, it is claimed that $\mathcal{R}_{i}$ is rejected. For convenience, the null hypothesis is said like T9 is true, and the alternative hypothesis is said like T9 is not true; rejection of $H_{0}$ implies that T9 is not true. This is more or less a typical test procedure. Note that selection bias is minor in the sense that $K_{\text{select}}/K_{\text{all}}\approx 1$ for many taxa, and non-selective tests may result in few false positives because $K_{\text{true}}/K_{\text{all}}\approx 0$ . Therefore selective inference is not much beneficial in outside mode.

In addition to $p$ -values for some trees and edges, estimated geometric quantities are also shown in the tables. We confirm that the sign of $\beta_{0}$ is estimated correctly for all the trees and edges. The estimated $\beta_{1}$ values are all positive, indicating the regions are convex. This is not surprising, because the regions are expressed as intersections of half spaces at least locally (figure 3B).

Now $p$ -values are examined in inside mode. (T1, E3) BP, AU, SI are all $p\leq 0.95$ . This indicates that T1 and E3 are not significantly supported. There are nothing claimed to be definite. (E1) BP, AU, SI are all $p>0.95$ , indicating E1 is significantly supported. Since E1 is associated with the best 15 trees T1, …, T15, some of them are significantly better than the rest of trees T16, …, T105. Significance for edges is common in phylogenetics as well as in hierarchical clustering (Suzuki and Shimodaira, 2006). (E2) The results split for this presumably wrong edge. $\text{AU}>0.95$ suggests E2 is significantly supported, whereas $\text{BP},\text{SI}\leq 0.95$ are not significant. AU tends to violate the selective type-I error, leading to false positives or overconfidence in wrong trees/edges, whereas SI is approximately unbiased for the selected hypothesis. This overconfidence is explained by the inequality $\text{AU}>\text{SI}$ (meant $\text{SI}^{\prime}$ here) for $\bm{y}\in\mathcal{R}$ , which is obtained by comparing (12) and (20). Therefore SI is preferable to AU in inside mode. BP is safer than AU in the sense that $\text{BP}<\text{AU}$ for $\beta_{1}>0$ , but BP is not guaranteed for controlling type-I error in a frequentist sense. The two inequalities ( $\text{SI},\text{BP}<\text{AU}$ ) are verified as relative positions of the contour lines at $p=0.95$ in figure 5. The three $p$ -values can be very different from each other for large $\beta_{1}$ .

Next $p$ -values are examined in outside mode. (T2, E4, E6) BP, AU, SI are all $p\geq 0.05$ . They are not rejected, and there are nothing claimed to be definite. (T8, T9, …, T105, E9,…, E25) BP, AU, SI are all $p<0.05$ . These trees and edges are rejected. (T7, E8) The results split for these presumably true tree and edge. $\text{BP}<0.05$ suggests T7 and E8 are rejected, whereas $\text{AU},\text{SI}\geq 0.05$ are not significant. AU is approximately unbiased for controlling the type-I error when $H_{0}$ is specified in advance (Shimodaira, 2002). Since $\text{BP}<\text{AU}$ for $\beta_{1}>0$ , BP violates the type-I error, which results in overconfidence in non-rejected wrong trees. Therefore BP should be avoided in outside mode. Inequality $\text{AU}<\text{SI}$ can be shown for $\bm{y}\in\mathcal{R}^{C}$ by comparing (10) and (18). Since the null hypothesis $H_{0}:\bm{\mu}\in\mathcal{R}$ is chosen after looking at $\bm{y}\in\mathcal{R}^{C}$ , AU is not approximately unbiased for controlling the selective type-I error, whereas SI adjusts this selection bias. The two inequalities ( $\text{BP}<\text{AU}<\text{SI}$ ) are verified as relative positions of the contour lines at $p=0.05$ in figure 5. AU and SI behave similarly (Note: $K_{\text{select}}/K_{\text{all}}\approx 1$ ), while BP is very different from AU and SI for large $\beta_{1}$ . It is arguable which of AU and SI is appropriate: AU is preferable to SI in tree selection ( $K_{\text{true}}=1$ ), because the multiplicity of testing is controlled as $\text{FWER}=P(\text{reject any true null})=P(\text{AU}(\mathcal{R}_{\text{true tree}}|\bm{Y})<\alpha\mid\bm{\mu}\in\mathcal{R}_{\text{true tree}})\leq\alpha$ . The FWER is multiplied by $K_{\text{true}}\geq 1$ for edge selection, and SI does not fix it either. For testing edges in outside mode, AU may be used for screening purpose with a small $\alpha$ value such as $\alpha/K_{\text{true}}$ .

5 Conclusion

We have developed a new method for computing selective inference $p$ -values from multiscale bootstrap probabilities, and applied this new method to phylogenetics. It is demonstrated through theory and a real-data analysis that selective inference $p$ -values are in particular useful for testing selected edges (i.e., clades or clusters of species) to claim that they are supported significantly if $p>1-\alpha$ . On the other hand, the previously proposed non-selective version of approximately unbiased $p$ -values are still useful for testing candidate trees to claim that they are rejected if $p<\alpha$ . Although we focused on phylogenetics, our general theory of selective inference may be applied to other model selection problems, or more general selection problems.

6 Remarks

6.1 Bootstrap resampling of log-likelihoods

Non-parametric bootstrap is often time consuming for recomputing the maximum likelihood (ML) estimates for bootstrap replicates. Kishino, Miyata and Hasegawa (1990) considered the resampling of estimated log-likelihoods (RELL) method for reducing the computation. Let $\mathcal{X}_{n}=(\bm{x}_{1},\ldots,\bm{x}_{n})$ be the dataset of sample size $n$ , where $\bm{x}_{t}$ is the site-pattern of amino acids at site $t$ for $t=1,\ldots,n$ . By resampling $\bm{x}_{t}$ from $\mathcal{X}_{n}$ with replacement, we obtain a bootstrap replicate $\mathcal{X}^{*}_{n^{\prime}}=(\bm{x}^{*}_{1},\ldots,\bm{x}^{*}_{n^{\prime}})$ of sample size $n^{\prime}$ . Although $n^{\prime}=n$ for the ordinary bootstrap, we will use several $n^{\prime}>0$ values for the multiscale bootstrap. The parametric model of probability distribution for tree T $i$ is $p_{i}(\bm{x};\bm{\theta}_{i})$ for $i=1,\ldots,105$ , and the log-likelihood function is $\ell_{i}(\bm{\theta}_{i};\mathcal{X}_{n})=\sum_{t=1}^{n}\log p_{i}(\bm{x}_{t};\bm{\theta}_{i})$ . Computation of the ML estimate $\bm{\hat{\theta}}_{i}=\operatorname*{arg\,max}_{\bm{\theta}_{i}}\ell_{i}(\bm{\theta}_{i};\mathcal{X}_{n})$ is time consuming, so we do not recalculate $\bm{\hat{\theta}}^{*}_{i}=\operatorname*{arg\,max}_{\bm{\theta}_{i}}\ell_{i}(\bm{\theta}_{i};\mathcal{X}^{*}_{n^{\prime}})$ for bootstrap replicates. Define the site-wise log-likelihood at site $t$ for tree T $i$ as

[TABLE]

so that the log-likelihood value for tree T $i$ is written as $\ell_{i}(\bm{\hat{\theta}}_{i};\mathcal{X}_{n})=\sum_{t=1}^{n}\xi_{ti}$ . The bootstrap replicate of the log-likelihood value is approximated as

[TABLE]

where $w^{*}_{t}$ is the number of times $\bm{x}_{t}$ appears in $\mathcal{X}^{*}_{n^{\prime}}$ . The accuracy of this approximation as well as the higher-order term is given in eqs. (4) and (5) of Shimodaira (2001). Once $\ell_{i}(\bm{\hat{\theta}}^{*}_{i};\mathcal{X}^{*}_{n^{\prime}})$ , $i=1,\ldots,105$ , are computed by (23), its ML tree is T $\hat{i}^{*}$ with $\hat{i}^{*}=\operatorname*{arg\,max}_{i=1,\ldots,105}\ell_{i}(\bm{\hat{\theta}}^{*}_{i};\mathcal{X}^{*}_{n^{\prime}})$ .

The non-parametric bootstrap probability of tree T $i$ is obtained as follows. We generate $B$ bootstrap replicates $\bm{X}^{*b}_{n^{\prime}}$ , $b=1,\ldots,B$ . In this paper, we used $B=10^{5}$ . For each $\bm{X}^{*b}_{n^{\prime}}$ , the ML tree T $\hat{i}^{*b}$ is computed by the method described above. Then we count the frequency that T $i$ becomes the ML tree in the $B$ replicates. The non-parametric bootstrap probability of tree T $i$ is computed by

[TABLE]

The non-parametric bootstrap probability of a edge is computed by summing $\text{BP}(\text{T}i,n^{\prime})$ over the associated trees.

An example of the transformation $\bm{Y}^{*}=\bm{f}_{n}(\mathcal{X}^{*}_{n^{\prime}})$ mentioned in Section 3.4 is

[TABLE]

where $\bm{L}^{*}_{n^{\prime}}=(1/n^{\prime})(\ell^{*}_{1},\ldots,\ell^{*}_{105})^{T}$ with $\ell^{*}_{i}=\ell_{i}(\bm{\hat{\theta}}^{*}_{i};\mathcal{X}^{*}_{n^{\prime}})$ and $\bm{V}_{n}$ is the variance matrix of $\bm{L}^{*}_{n}$ . According to the approximation (23) and the central limit theorem, (13) holds well for sufficiently large $n$ and $n^{\prime}$ with $m=104$ and $\sigma^{2}=n/n^{\prime}$ . It also follows from the above argument that $\text{var}(\ell^{*}_{i}-\ell^{*}_{j})\approx(n^{\prime}/n)\|\bm{\xi}_{i}-\bm{\xi}_{j}\|^{2}$ , and thus the variance of log-likelihood difference is

[TABLE]

which gives another insight into the visualization of Section 6.2, where the variance can be interpreted as the divergence between the two models; see eq. (27). This approximation holds well when the two predictive distributions $p_{i}(\bm{x};\bm{\hat{\theta}}_{i})$ , $p_{j}(\bm{x};\bm{\hat{\theta}}_{j})$ are not very close to each other. When they are close to each other, however, the higher-order term ignored in (26) becomes dominant, and there is a difficulty for deriving the limiting distribution of the log-likelihood difference in the model selection test (Shimodaira, 1997; Schennach and Wilhelm, 2017).

6.2 Visualization of Probability Models

For representing the probability distribution of tree T $i$ , we define $\bm{\xi}_{i}:=(\xi_{1i},\ldots,\xi_{ni})^{T}\in\mathbb{R}^{n}$ from (22) for $i=1,\ldots,15$ . The idea behind the visualization of figure 3 is that locations of $\bm{\xi}_{i}$ in $\mathbb{R}^{n}$ will represent locations of $p_{i}(\bm{x};\bm{\hat{\theta}}_{i})$ in the space of probability distributions. Let $D_{\text{KL}}(p_{i}\|p_{j})$ be the Kullback-Leibler divergence between the two distributions. For sufficiently small $(1/n)\|\bm{\xi}_{i}-\bm{\xi}_{j}\|^{2}$ , the squared distance in $\mathbb{R}^{n}$ approximates $n$ times Jeffreys divergence

[TABLE]

for non-nested models (Shimodaira, 2001, Section 6). When a model $p_{0}$ is nested in $p_{i}$ , it becomes $\|\bm{\xi}_{i}-\bm{\xi}_{0}\|^{2}\approx 2n\times D_{\text{KL}}(p_{i}(\bm{x};\bm{\hat{\theta}}_{i})\|p_{0}(\bm{x};\bm{\hat{\theta}}_{0}))\approx 2\times(\ell_{i}(\bm{\hat{\theta}}_{i};\mathcal{X}_{n})-\ell_{0}(\bm{\hat{\theta}}_{0};\mathcal{X}_{n}))$ . We explain three different visualizations of figure 7. There are only minor differences between the plots, and the visualization is not sensitive to the details.

For dimensionality reduction, we have to specify the origin $\bm{c}\in\mathbb{R}^{n}$ and consider vectors $\bm{a}_{i}:=\bm{\xi}_{i}-\bm{c}$ . A naive choice would be the average $\bm{c}=\sum_{i=1}^{15}\bm{\xi}_{i}/15$ . By applying PCA without centering and scaling (e.g., prcomp with option center=FALSE, scale=FALSE in R) to the matrix $(\bm{a}_{1},\ldots,\bm{a}_{15})$ , we obtain the visualization of $\bm{\xi}_{i}$ as the axes (red arrows) of biplot in figure 7A.

For computing the “data point” $X$ in figure 3, we need more models. Let tree T106 be the star topology with no internal branch (completely unresolved tree), and T107 $,\ldots,$ T131 be partially resolved tree topologies with only one internal branch corresponding to E1 $,\ldots,$ E25, whereas T1 $,\ldots,$ T105 are fully resolved trees (bifurcating trees). Then define $\bm{\eta}_{i}:=\bm{\xi}_{106+i}$ , $i=0,\ldots,25$ . Now we take $\bm{c}=\bm{\eta}_{0}$ for computing $\bm{a}_{i}=\bm{\xi}_{i}-\bm{\eta}_{0}$ and $\bm{b}_{i}=\bm{\eta}_{i}-\bm{\eta}_{0}$ . There is hierarchy of models: $\bm{\eta}_{0}$ is the submodel nested in all the other models, and $\bm{\eta}_{1},\bm{\eta}_{2},\bm{\eta}_{3}$ , for example, are submodels of $\bm{\xi}_{1}$ (T1 includes E1, E2, E3). By combining these non-nested models, we can reconstruct a comprehensive model in which all the other models are nested as submodels (Shimodaira, 2001, eq. (10) in Section 5). The idea is analogous to reconstructing the full model $y=\beta_{1}x_{1}+\cdots+\beta_{25}x_{25}+\epsilon$ of multiple regression from submodels $y=\beta_{1}x_{1}+\epsilon,\ldots,y=\beta_{25}x_{25}+\epsilon$ . Thus we call it as “full model” in this paper, and the ML estimate of the full model is indicated as the data point $X$ ; it is also said “super model” in Shimodaira and Hasegawa (2005). Let $\bm{B}=(\bm{b}_{1},\ldots,\bm{b}_{25})\in\mathbb{R}^{n\times 25}$ and $\bm{d}=(\|\bm{b}_{1}\|^{2},\ldots,\|\bm{b}_{25}\|^{2})^{T}\in\mathbb{R}^{25}$ , then the vector for the full model is computed approximately by

[TABLE]

For the visualization of the best 15 trees, we may use only $\bm{b}_{1},\ldots,\bm{b}_{11}$ , because they include E1 and two more edges from E2 $,\ldots,$ E11. In figures 3 and 7B, we actually modified the above computation slightly so that the star topology T106 is replaced by T107, the partially resolved tree corresponding to E1 (T107 is also said star topology by treating clade (23) as a leaf of the tree), and the 10 partially resolved trees for E2 $,\ldots,$ E11 are replaced by those for (E1,E2) $,\ldots,$ (E1,E11), respectively; the origin becomes the maximal model nested in all the 15 trees, and $X$ becomes the minimal full model containing all the 15 trees. Just before applying PCA in figure 7B, $\bm{a}_{1},\ldots,\bm{a}_{15}$ are projected to the space orthogonal to $\bm{a}_{X}$ , so that the plot becomes the “top-view” of figure 3A with $\bm{a}_{X}$ being at the origin.

In figure 7C, we attempted a even simpler computation without using ML estimates for partially resolved trees. We used $\bm{B}=(\bm{a}_{1},\ldots,\bm{a}_{15})$ and $\bm{d}=(\|\bm{a}_{1}\|^{2},\ldots,\|\bm{a}_{15}\|^{2})^{T}$ , and taking the largest 10 singular values for computing the inverse in (28). The orthogonal projection to $\bm{a}_{X}$ is applied before PCA.

6.3 Asymptotic Theory of Smooth Surfaces

For expressing the shape of the region $\mathcal{R}\subset\mathbb{R}^{m+1}$ , we use a local coordinate system $(\bm{u},v)\in\mathbb{R}^{m+1}$ with $\bm{u}\in\mathbb{R}^{m},v\in\mathbb{R}$ . In a neighborhood of $\bm{y}$ , the region is expressed as

[TABLE]

where $h$ is a smooth function; see Shimodaira (2008) for the theory of non-smooth surfaces. The boundary surface $\partial\mathcal{R}$ is expressed as $v=-h(\bm{u})$ , $\bm{u}\in\mathbb{R}^{m}$ . We can choose the coordinates so that $\bm{y}=(\bm{0},\beta_{0})$ (i.e., $\bm{u}=(0,\ldots,0)$ and $v=\beta_{0}$ ), and $h(\bm{0})=0$ , $\partial h/\partial u_{i}|_{\bm{0}}=0$ , $i=1,\ldots,m$ . The projection now becomes the origin $\bm{\hat{\mu}}=(\bm{0},0)$ , and the signed distance is $\beta_{0}$ . The mean curvature of surface $\partial\mathcal{R}$ at $\bm{\hat{\mu}}$ is now defined as

[TABLE]

which is interpreted as the trace of the hessian matrix of $h$ . When $\mathcal{R}$ is convex at least locally in the neighborhood, all the eigenvalues of the hessian are non-negative, leading to $\beta_{1}\geq 0$ , whereas concave $\mathcal{R}$ leads to $\beta_{1}\leq 0$ . In particular, $\beta_{1}=0$ when $\partial\mathcal{R}$ is flat (i.e., $h(\bm{u})\equiv 0$ ).

Since the transformation $\bm{y}=\bm{f}_{n}(\mathcal{X}_{n})$ depends on $n$ , the shape of the region $\mathcal{R}$ actually depends on $n$ , although the dependency is implicit in the notation. As $n$ goes larger, the standard deviation of estimates, in general, reduces at the rate $n^{-1/2}$ . For keeping the variance constant in (4), we actually magnifying the space by the factor $n^{1/2}$ , meaning that the boundary surface $\partial\mathcal{R}$ approaches flat as $n\to\infty$ . More specifically, the magnitude of mean curvature is of order $\beta_{1}=O_{p}(n^{-1/2})$ . The magnitude of $\partial^{3}h/\partial u_{i}\partial u_{j}\partial u_{k}$ and higher order derivatives is $O_{p}(n^{-1})$ , and we ignore these terms in our asymptotic theory. For keeping $\bm{\mu}=O(1)$ in (4), we also consider the setting of “local alternatives”, meaning that the parameter values approach a origin on the boundary at the rate $n^{-1/2}$ .

6.4 Bridging the Problem of Regions to the Z-Test

Here we explain the problem of regions in terms of the $z$ -test by bridging the multivariate problem of Section 3 to the 1-dimensional case of Section 1.

Ideal $p$ -values are uniformly distributed over $p\in(0,1)$ when the null hypothesis holds. In fact, $\text{AU}(\mathcal{R}|\bm{Y})\sim U(0,1)$ for $\bm{\mu}\in\partial\mathcal{R}$ as indicated in (11). The statistic $\text{AU}(\mathcal{R}|\bm{Y})$ may be called pivotal in the sense that the distribution does not change when $\bm{\mu}\in\partial\mathcal{R}$ moves on the surface. Here we ignore the error of $O_{p}(n^{-1})$ , and consider only the second order asymptotic accuracy. From (10), we can write $\text{AU}(\mathcal{R}|\bm{Y})\simeq\bar{\Phi}(\beta_{0}(\bm{Y})-\beta_{1}(\bm{Y}))$ , where the notation such as $\beta_{0}(\bm{Y})$ and $\beta_{1}(\bm{Y})$ indicates the dependency on $\bm{Y}$ . Since $\beta_{1}(\bm{Y})\simeq\beta_{1}(\bm{y})=\beta_{1}$ , we treat $\beta_{1}(\bm{Y})$ as a constant. Now we get the normal pivotal quantity (Efron, 1985) as $\bar{\Phi}^{-1}(\text{AU}(\mathcal{R}|\bm{Y}))=\beta_{0}(\bm{Y})-\beta_{1}\sim N(0,1)$ for $\bm{\mu}\in\partial\mathcal{R}$ . More generally, it becomes

[TABLE]

Let us look at the $z$ -test in Section 1, and consider substitutions:

[TABLE]

The 1-dimensional model (1) is now equivalent to (31). The null hypothesis is also equivalent: $\theta\leq 0\Leftrightarrow\beta_{0}(\bm{\mu})\leq 0\Leftrightarrow\bm{\mu}\in\mathcal{R}$ . We can easily verify that AU corresponds to $p(z)$ , because $p(z)=\bar{\Phi}(z)=\bar{\Phi}(\beta_{0}(\bm{y})-\beta_{1})\simeq\text{AU}(\mathcal{R}|\bm{y})$ , which is expected from the way we obtained (31) above. Furthermore, we can derive SI from $p(z,c)$ . First verify that the selection event is equivalent: $Z>c\Leftrightarrow\beta_{0}(\bm{Y})-\beta_{1}>-\beta_{1}\Leftrightarrow\beta_{0}(\bm{Y})>0\Leftrightarrow\bm{Y}\in\mathcal{R}^{C}$ . Finally, we obtain SI as $p(z,c)=p(z)/\bar{\Phi}(c)\simeq\bar{\Phi}(\beta_{0}(\bm{y})-\beta_{1})/\bar{\Phi}(-\beta_{1})\simeq\text{SI}(\mathcal{R}|\bm{y})$ .

6.5 Model Fitting in Multiscale Bootstrap

We have used thirteen $\sigma^{2}$ values from 1/9 to 9 (equally spaced in log-scale). This range is relatively large, and we observe a slight deviation from the linear model $\beta_{0}+\beta_{1}\sigma^{2}$ in figure 6. Therefore we fit other models to the observed values of $\psi_{\sigma^{2}}$ as implemented in scaleboot package (Shimodaira, 2008). For example, poly. $k$ model is $\sum_{i=0}^{k-1}\beta_{i}\sigma^{2i}$ , and sing.3 model is $\beta_{0}+\beta_{1}\sigma^{2}(1+\beta_{2}(\sigma-1))^{-1}$ . In figure 6A, poly.3 is the best model according to AIC (Akaike, 1974). In figure 6B, poly.2, poly.3, and sing.3 are combined by model averaging with Akaike weights. Then $\beta_{0}$ and $\beta_{1}$ are estimated from the tangent line to the fitted curve of $\psi_{\sigma^{2}}$ at $\sigma^{2}=1$ . In figure 6, the tangent line is drawn as red line for extrapolating $\psi_{\sigma^{2}}$ to $\sigma^{2}=-1$ . Shimodaira (2008); Terada and Shimodaira (2017) considered the Taylor expansion of $\psi_{\sigma^{2}}$ at $\sigma^{2}=1$ as a generalization of the tangent line for improving the accuracy of AU and SI.

In the implementation of CONSEL (Shimodaira and Hasegawa, 2001) and pvclust (Suzuki and Shimodaira, 2006), we use a narrower range of $\sigma^{2}$ values (ten $\sigma^{-2}$ values: 0.5, 0.6, $\ldots,$ 1.4). Only the linear model $\beta_{0}+\beta_{1}\sigma^{2}$ is fitted there. The estimated $\beta_{0}$ and $\beta_{1}$ should be very close to those estimated from the tangent line described above. An advantage of using wider range of $\sigma^{2}$ in scaleboot is that the standard error of $\beta_{0}$ and $\beta_{1}$ will become smaller.

6.6 General Formula of Selective Inference

Let $\mathcal{H},\mathcal{S}\subset\mathbb{R}^{m+1}$ be regions for the null hypothesis and the selection event, respectively. We would like to test the null hypothesis $H_{0}:\bm{\mu}\in\mathcal{H}$ against the alternative $H_{1}:\bm{\mu}\in\mathcal{H}^{C}$ conditioned on the selection event $\bm{y}\in\mathcal{S}$ . We have considered the outside mode $\mathcal{H}=\mathcal{R},\mathcal{S}=\mathcal{R}^{C}$ in (18) and the inside mode $\mathcal{H}=\mathcal{R}^{C},\mathcal{S}=\mathcal{R}$ in (20). For a general case of $\mathcal{H},\mathcal{S}$ , Terada and Shimodaira (2017) gave a formula of approximately unbiased $p$ -value of selective inference as

[TABLE]

where geometric quantities $\beta_{0},\beta_{1}$ are defined for the regions $\mathcal{H},\mathcal{S}$ . We assumed that $\mathcal{H}$ and $\mathcal{S}^{C}$ are expressed as (29), and two surfaces $\partial\mathcal{H},\partial\mathcal{S}$ are nearly parallel to each other with tangent planes differing only $O_{p}(n^{-1/2})$ . The last assumption always holds for (18), because $\partial\mathcal{H}=\partial\mathcal{R}$ and $\partial\mathcal{S}=\partial\mathcal{R}^{C}$ are identical and of course parallel to each other.

Here we explain why we have considered the special case of $\mathcal{S}=\mathcal{H}^{C}$ for phylogenetic inference. First, we suppose that the selection event satisfies $\mathcal{S}\subset\mathcal{H}^{C}$ , because a reasonable test would not reject $H_{0}$ unless $\bm{y}\in\mathcal{H}^{C}$ . Note that $\bm{y}\in\mathcal{S}\subset\mathcal{H}^{C}$ implies $0\leq-\beta_{0}^{\mathcal{S}}\leq\beta_{0}^{\mathcal{H}}$ . Therefore, $\beta_{0}^{\mathcal{H}}+\beta_{0}^{\mathcal{S}}\geq 0$ leads to

[TABLE]

where $\text{SI}(\mathcal{H}|\bm{y}):=\text{SI}(\mathcal{H}|\mathcal{H}^{C},\bm{y})$ is obtained from (33) by letting $\beta_{0}^{\mathcal{H}}+\beta_{0}^{\mathcal{S}}=0$ for $\mathcal{S}=\mathcal{H}^{C}$ . The $p$ -value $\text{SI}(\mathcal{H}|\mathcal{S},\bm{y})$ becomes smaller as $\mathcal{S}$ grows, and $\mathcal{S}=\mathcal{H}^{C}$ gives the smallest $p$ -value, leading to the most powerful selective test. Therefore the choice $\mathcal{S}=\mathcal{H}^{C}$ is preferable to any other choice of selection event satisfying $\mathcal{S}\subset\mathcal{H}^{C}$ . This kind of property is mentioned in Fithian, Sun and Taylor (2014) as the monotonicity of selective error in the context of “data curving”.

Let us see how these two $p$ -values differ for the case of E2 by specifying $\mathcal{H}=\mathcal{R}^{C}_{\text{E2}}$ and $\mathcal{S}=\mathcal{R}_{\text{T1}}$ . In this case, the two surfaces $\partial\mathcal{H},\partial\mathcal{S}$ may not be very parallel to each other, thus violating the assumption of $\text{SI}(\mathcal{H}|\mathcal{S},\bm{y})$ , so we only intend to show the potential difference between the two $p$ -values. The geometric quantities are $\beta_{0}^{\mathcal{H}}=-\beta_{0}^{\text{E2}}=1.59$ , $\beta_{1}^{\mathcal{H}}=-\beta_{1}^{\text{E2}}=-0.12$ , $\beta_{0}^{\mathcal{S}}=\beta_{0}^{\text{T1}}=-0.41$ ; the $p$ -values are calculated using more decimal places than shown. SI of E2 conditioned on selecting T1 is

[TABLE]

and it is very different from SI of E2 conditioned on selecting E2

[TABLE]

where $\text{SI}^{\prime}(\mathcal{R}_{\text{E2}}^{C}|\bm{y})=1-\text{SI}(\mathcal{R}_{\text{E2}}^{C}|\bm{y})=0.903$ is shown in table 2. As you see, $\text{SI}(\mathcal{H}|\bm{y})$ is easier to reject $H_{0}$ than $\text{SI}(\mathcal{H}|\mathcal{S},\bm{y})$ .

6.7 Number of regions for phylogenetic inference

The regions $\mathcal{R}_{i}$ , $i=1,\ldots,K_{\text{all}}$ correspond to trees or edges. In inside and outside modes, the number of total regions is $K_{\text{all}}=105$ for trees and $K_{\text{all}}=25$ for edges when the number of taxa is $N=6$ . For general $N\geq 3$ , they grow rapidly as $K_{\text{all}}=(2N-5)!/(2^{N-3}(N-3)!)$ for trees and $K_{\text{all}}=2^{N-1}-(N+1)$ for edges. Next consider the number of selected regions $K_{\text{select}}$ . In inside mode, regions with $\bm{y}\in\mathcal{R}_{i}$ are selected, and the number is counted as $K_{\text{select}}=1$ for trees and $K_{\text{select}}=N-3=3$ for edges. In outside mode, regions with $\bm{y}\not\in\mathcal{R}_{i}$ are selected, and thus the number is $K_{\text{all}}$ minus that for inside mode; $K_{\text{select}}=K_{\text{all}}-1=104$ for trees and $K_{\text{select}}=K_{\text{all}}-(N-3)=22$ for edges. Finally, consider the number of true null hypotheses, denoted as $K_{\text{true}}$ . The null hypothesis holds true when $\bm{\mu}\not\in\mathcal{R}_{i}$ in inside mode and $\bm{\mu}\in\mathcal{R}_{i}$ in outside mode, and thus $K_{\text{true}}$ is the same as the number of regions with $\bm{y}\not\in\mathcal{R}_{i}$ in inside mode and $\bm{y}\in\mathcal{R}_{i}$ in outside mode (These numbers do not depend on the value of $\bm{y}$ by ignoring the case of $\bm{y}\in\partial\mathcal{R}_{i}$ ). Therefore, $K_{\text{true}}=K_{\text{all}}-K_{\text{select}}$ for both cases.

6.8 Selective Inference of Lasso Regression

Selective inference is considered for the variable selection of regression analysis. Here, we deal with prostate cancer data (Stamey et al., 1989) in which we predict the level of prostate-specific antigen (PSA) from clinical measures. The dataset is available in the R package ElemStatLearn (Halvorsen, 2015). We consider a linear model to the log of PSA (lpsa), with $8$ predictors such as the log prostate weight (lweight), age, and so on. All the variables are standardized to have zero mean and unit variance.

The goal is to provide the valid selective inference for the partial regression coefficients of the selected variables by lasso (Tibshirani, 1996). Let $n$ and $p$ be the number of observations and the number of predictors. $\bm{\hat{M}}$ is the set of selected variables, and $\bm{\hat{s}}$ represents the signs of the selected regression coefficients. We suppose that regression responses are distributed as $\bm{Y}\sim N(\bm{\mu},\tau^{2}\bm{I}_{n})$ where $\bm{\mu}\in\mathbb{R}^{n}$ and $\tau>0$ . Let $e_{i}$ be the $i$ th residual. Resampling the scaled residuals $\sigma e_{i}\;(i=1,\dots,n)$ with several values of scale $\sigma^{2}$ , we can apply the multiscale bootstrap method described in Section 4 for the selective inference in the regression problem. Here, we note that the target of the inference is the true partial regression coefficients:

[TABLE]

where $\bm{X}\in\mathbb{R}^{n\times p}$ is the design matrix. We compute four types of intervals with confidence level $1-\alpha=0.95$ for selected variable $j$ . $[L_{j}^{\text{ordinary}},U_{j}^{\text{ordinary}}]$ is the non-selective confidence interval obtained via $t$ -distribution. $[L_{j}^{\text{model}},U_{j}^{\text{model}}]$ is the selective confidence interval under the selected model proposed by Lee et al. (2016) and Tibshirani et al. (2016), which is computed by fixedLassoInf with type="full" in R package selectiveInference (Tibshirani et al., 2017). By extending the method of $[L_{j}^{\text{model}},U_{j}^{\text{model}}]$ , we also computed $[L_{j}^{\text{variable}},U_{j}^{\text{variable}}]$ , which is the selective confidence interval under the selection event that variable $j$ is selected. These three confidence intervals are exact, in the sense that

[TABLE]

Note that the selection event of variable $j$ , i.e., $\{j\in\bm{\hat{M}},\hat{s}_{j}\}$ can be represented as a union of polyhedra on $\mathbb{R}^{n}$ , and thus, according to the polyhedral lemma (Lee et al., 2016; Tibshirani et al., 2016), we can compute a valid confidence interval $[L_{j}^{\text{variable}},U_{j}^{\text{variable}}]$ . However, this computation is prohibitive for $p>10$ , because all the possible combinations of models with variable $j$ are considered. Therefore, we compute its approximation $[\hat{L}_{j}^{\text{variable}},\hat{U}_{j}^{\text{variable}}]$ by the multiscale bootstrap method of Section 4 with much faster computation even for larger $p$ .

We set $\lambda=10$ as the penalty parameter of lasso, and the following model and signs were selected:

[TABLE]

The confidence intervals are shown in figure 1. For adjusting the selection bias, the three confidence intervals of selective inference are longer than the ordinary confidence interval. Comparing $[L_{j}^{\text{model}},U_{j}^{\text{model}}]$ and $[L_{j}^{\text{variable}},U_{j}^{\text{variable}}]$ , the latter is shorter, and would be preferable. This is because the selection event of the latter is less restrictive as $\{\bm{\hat{M}},\bm{\hat{s}}\}\subseteq\{j\in\bm{\hat{M}},\hat{s}_{j}\}$ ; see Section 6.6 for the reason why larger selection event is better. Finally, we verify that $[\hat{L}_{j}^{\text{variable}},\hat{U}_{j}^{\text{variable}}]$ approximates $[L_{j}^{\text{variable}},U_{j}^{\text{variable}}]$ very well.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Author Contributions

HS and YT developed the theory of selective inference. HS programmed the multiscale bootstrap software and conducted the phylogenetic analysis. YT conducted the lasso analysis. HS wrote the manuscript. All authors have approved the final version of the manuscript.

Funding

This research was supported in part by JSPS KAKENHI Grant (16H02789 to HS, 16K16024 to YT).

Acknowledgments

The authors appreciate the feedback from the audience of seminar talk of HS at Department of Statistics, Stanford University. The authors are grateful to Masami Hasegawa for his insightful comments on phylogenetic analysis of mammal species.

Data Availability Statement

The datasets analyzed for this study can be found in the software package scaleboot (Shimodaira, 2019).

Figure captions

Tables

Bibliography48

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Adachi and Hasegawa (1996) {barticle} [author] \bauthor \bsnm Adachi, \bfnm J. \binits J. and \bauthor \bsnm Hasegawa, \bfnm M. \binits M. ( \byear 1996). \btitle Model of amino acid substitution in proteins encoded by mitochondrial DNA. \bjournal J. Mol. Evol. \bvolume 42 \bpages 459–468. \endbibitem
2Akaike (1974) {barticle} [author] \bauthor \bsnm Akaike, \bfnm Hirotugu \binits H. ( \byear 1974). \btitle A new look at the statistical model identification. \bjournal Automatic Control, IEEE Transactions on \bvolume 19 \bpages 716–723. \endbibitem
3Amari and Nagaoka (2007) {bbook} [author] \bauthor \bsnm Amari, \bfnm Shun-Ichi \binits S.-I. and \bauthor \bsnm Nagaoka, \bfnm Hiroshi \binits H. ( \byear 2007). \btitle Methods of information geometry \bvolume 191. \bpublisher American Mathematical Soc. \endbibitem
4Burnham and Anderson (2002) {bbook} [author] \bauthor \bsnm Burnham, \bfnm Kenneth P \binits K. P. and \bauthor \bsnm Anderson, \bfnm David R \binits D. R. ( \byear 2002). \btitle Model selection and multimodel inference: a practical information-theoretic approach. \bpublisher Springer. \endbibitem
5Cox (1962) {barticle} [author] \bauthor \bsnm Cox, \bfnm David R \binits D. R. ( \byear 1962). \btitle Further results on tests of separate families of hypotheses. \bjournal Journal of the Royal Statistical Society. Series B (Methodological) \bvolume 24 \bpages 406–424. \endbibitem
6Efron (1979) {barticle} [author] \bauthor \bsnm Efron, \bfnm B. \binits B. ( \byear 1979). \btitle Bootstrap Methods: Another Look At the Jackknife. \bjournal Annals of Statistics \bvolume 7 \bpages 1–26. \endbibitem
7Efron (1984) {barticle} [author] \bauthor \bsnm Efron, \bfnm Bradley \binits B. ( \byear 1984). \btitle Comparing non-nested linear models. \bjournal Journal of the American Statistical Association \bvolume 79 \bpages 791–803. \endbibitem
8Efron (1985) {barticle} [author] \bauthor \bsnm Efron, \bfnm Bradley \binits B. ( \byear 1985). \btitle Bootstrap Confidence Intervals for a Class of Parametric Problems. \bjournal Biometrika \bvolume 72 \bpages 45–58. \endbibitem