Rank-Based Causal Discovery for Post-Nonlinear Models

Grigor Keropyan; David Strieder; Mathias Drton

arXiv:2302.12341·stat.ML·September 11, 2023

Rank-Based Causal Discovery for Post-Nonlinear Models

Grigor Keropyan, David Strieder, Mathias Drton

PDF

Open Access

TL;DR

This paper introduces a novel rank-based method for discovering causal relationships in post-nonlinear models, improving robustness and consistency over existing residual dependency approaches.

Contribution

The paper proposes a new rank-based approach for PNL causal discovery that leverages model invariances and separates function estimation from independence testing.

Findings

01

Method is consistent in theory.

02

Performs well in numerical experiments.

03

Reduces overfitting compared to residual dependency methods.

Abstract

Learning causal relationships from empirical observations is a central task in scientific research. A common method is to employ structural causal models that postulate noisy functional relations among a set of interacting variables. To ensure unique identifiability of causal directions, researchers consider restricted subclasses of structural causal models. Post-nonlinear (PNL) causal models constitute one of the most flexible options for such restricted subclasses, containing in particular the popular additive noise models as a further subclass. However, learning PNL models is not well studied beyond the bivariate case. The existing methods learn non-linear functional relations by minimizing residual dependencies and subsequently test independence from residuals to determine causal orientations. However, these methods can be prone to overfitting and, thus, difficult to tune…

Tables24

Table 1. Table 1: Results of RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 100 , 100 ) similar-to 𝛽 𝑈 100 100 \beta\sim U(-100,100) for Gaussian noise ( 100 100 100 repetitions).

-	Gaussian noise
-	RankG	AbPNL	RESIT
100	2.51 $\pm$ 1.26	3.94 $\pm$ 1.09	2.4 $\pm$ 1.16
500	1.86 $\pm$ 1.31	3.46 $\pm$ 1.08	2.41 $\pm$ 1.23
1000	1.61 $\pm$ 1.15	3.14 $\pm$ 1.14	2.43 $\pm$ 1.16
1500	1.66 $\pm$ 1.35	3.28 $\pm$ 1.18	2.62 $\pm$ 1.36
2000	1.63 $\pm$ 1.17	3.23 $\pm$ 1.12	3.15 $\pm$ 1.51

Table 2. Table 2: Results of RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 100 , 100 ) similar-to 𝛽 𝑈 100 100 \beta\sim U(-100,100) for Gumbel noise ( 100 100 100 repetitions).

-	Gumbel noise
-	RankG	AbPNL	RESIT
100	2.86 $\pm$ 1.36	3.98 $\pm$ 1.36	2.3 $\pm$ 1.18
500	2.34 $\pm$ 1.45	3.07 $\pm$ 1.24	2.49 $\pm$ 1.2
1000	1.8 $\pm$ 1.28	3.31 $\pm$ 1.11	2.43 $\pm$ 1.28
1500	1.68 $\pm$ 1.1	3.06 $\pm$ 1.03	3.04 $\pm$ 1.31
2000	1.37 $\pm$ 1.14	3.17 $\pm$ 1.18	3.02 $\pm$ 1.41

Table 3. Table 3: Results of RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 100 , 100 ) similar-to 𝛽 𝑈 100 100 \beta\sim U(-100,100) for Logistic noise ( 100 100 100 repetitions).

-	Logistic noise
-	RankG	AbPNL	RESIT
100	2.61 $\pm$ 1.35	3.89 $\pm$ 1.11	2.31 $\pm$ 1.1
500	2.06 $\pm$ 1.34	3.32 $\pm$ 1.24	2.36 $\pm$ 1.13
1000	1.85 $\pm$ 1.37	3.41 $\pm$ 1.1	2.66 $\pm$ 1.24
1500	1.68 $\pm$ 1.38	3.27 $\pm$ 1.2	2.92 $\pm$ 1.32
2000	1.79 $\pm$ 1.18	3.29 $\pm$ 1.16	3.44 $\pm$ 1.38

Table 4. Table 4: Results of RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 10 , 10 ) similar-to 𝛽 𝑈 10 10 \beta\sim U(-10,10) for Gaussian noise ( 100 100 100 repetitions).

-	Gaussian noise
-	RankG	AbPNL	RESIT
100	1.21 $\pm$ 1.23	3.7 $\pm$ 1.28	2.72 $\pm$ 1.2
500	0.21 $\pm$ 0.56	2.8 $\pm$ 1.01	2.69 $\pm$ 1.12
1000	0.15 $\pm$ 0.59	3.18 $\pm$ 1.08	2.73 $\pm$ 1.08
1500	0.04 $\pm$ 0.24	3.06 $\pm$ 1.08	2.9 $\pm$ 1.06
2000	0.05 $\pm$ 0.3	3.26 $\pm$ 1.17	3.1 $\pm$ 1.29

Table 5. Table 5: Results of RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 10 , 10 ) similar-to 𝛽 𝑈 10 10 \beta\sim U(-10,10) for Gumbel noise ( 100 100 100 repetitions).

-	Gumbel noise
-	RankG	AbPNL	RESIT
100	1.44 $\pm$ 1.26	3.47 $\pm$ 1.13	2.86 $\pm$ 1.26
500	0.42 $\pm$ 0.91	2.61 $\pm$ 1.1	2.62 $\pm$ 1.19
1000	0.13 $\pm$ 0.39	2.88 $\pm$ 1.15	2.75 $\pm$ 1.15
1500	0.15 $\pm$ 0.54	2.76 $\pm$ 1.14	2.98 $\pm$ 1.2
2000	0.08 $\pm$ 0.37	2.83 $\pm$ 1.21	3.29 $\pm$ 1.26

Table 6. Table 6: Results of RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 10 , 10 ) similar-to 𝛽 𝑈 10 10 \beta\sim U(-10,10) for Logistic noise ( 100 100 100 repetitions).

-	Logistic noise
-	RankG	AbPNL	RESIT
100	1.05 $\pm$ 1.3	3.82 $\pm$ 1.15	2.5 $\pm$ 1.24
500	0.39 $\pm$ 0.79	2.84 $\pm$ 1.02	2.46 $\pm$ 1
1000	0.14 $\pm$ 0.47	3.01 $\pm$ 1.14	2.53 $\pm$ 1.11
1500	0 $\pm$ 0	2.95 $\pm$ 1.18	2.98 $\pm$ 1.05
2000	0 $\pm$ 0	2.69 $\pm$ 1.17	3.28 $\pm$ 1.21

Table 7. Table 7: Results of RankG, RESIT and AbPNL methods on 7 nodes with β ∼ U ( − 100 , 100 ) similar-to 𝛽 𝑈 100 100 \beta\sim U(-100,100) for Gaussian noise ( 100 100 100 repetitions).

-	Gaussian noise
-	RankG	AbPNL	RESIT
100	4.65 $\pm$ 2.21	6.72 $\pm$ 2.13	3.89 $\pm$ 1.85
500	3.75 $\pm$ 2.11	5.4 $\pm$ 1.88	4.1 $\pm$ 1.76
1000	3.29 $\pm$ 2.03	5.46 $\pm$ 1.96	4.27 $\pm$ 1.95
1500	2.94 $\pm$ 2.13	5.4 $\pm$ 1.87	4.34 $\pm$ 1.88
2000	3.35 $\pm$ 2.09	5.49 $\pm$ 1.84	5.02 $\pm$ 2.09

Table 8. Table 8: Results of RankG, RESIT and AbPNL methods on 7 nodes with β ∼ U ( − 100 , 100 ) similar-to 𝛽 𝑈 100 100 \beta\sim U(-100,100) for Gumbel noise ( 100 100 100 repetitions).

-	Gumbel noise
-	RankG	AbPNL	RESIT
100	4.74 $\pm$ 2.12	6.33 $\pm$ 2.26	3.95 $\pm$ 2.09
500	3.68 $\pm$ 1.97	5.41 $\pm$ 1.83	3.95 $\pm$ 1.75
1000	3.4 $\pm$ 1.98	5.39 $\pm$ 1.92	3.78 $\pm$ 1.92
1500	3.33 $\pm$ 2.05	5.96 $\pm$ 2.12	4.66 $\pm$ 1.99
2000	3.47 $\pm$ 2.61	5.26 $\pm$ 1.7	5.07 $\pm$ 2.41

Table 9. Table 9: Results of RankG, RESIT and AbPNL methods on 7 nodes with β ∼ U ( − 100 , 100 ) similar-to 𝛽 𝑈 100 100 \beta\sim U(-100,100) for Logistic noise( 100 100 100 repetitions).

-	Logistic noise
-	RankG	AbPNL	RESIT
100	4.56 $\pm$ 2.1	6.45 $\pm$ 2.08	4.01 $\pm$ 1.96
500	3.65 $\pm$ 2.13	5.52 $\pm$ 1.49	3.86 $\pm$ 1.85
1000	3.88 $\pm$ 2.42	5.35 $\pm$ 1.96	4.18 $\pm$ 1.88
1500	2.93 $\pm$ 1.98	5.37 $\pm$ 1.8	4.42 $\pm$ 2.01
2000	3.39 $\pm$ 1.97	5.1 $\pm$ 1.64	4.96 $\pm$ 2.17

Table 10. Table 10: Results of RankG, RESIT and AbPNL methods on 7 nodes with β ∼ U ( − 10 , 10 ) similar-to 𝛽 𝑈 10 10 \beta\sim U(-10,10) for Gaussian noise ( 100 100 100 repetitions).

-	Gaussian noise
-	RankG	AbPNL	RESIT
100	2.44 $\pm$ 1.69	6.71 $\pm$ 2.47	4.15 $\pm$ 1.75
500	1.11 $\pm$ 1.46	5.06 $\pm$ 1.75	4.4 $\pm$ 1.84
1000	0.28 $\pm$ 0.75	5.08 $\pm$ 1.87	4.47 $\pm$ 1.88
1500	0.19 $\pm$ 0.6	5.18 $\pm$ 2	4.35 $\pm$ 1.79
2000	0.19 $\pm$ 0.69	5.26 $\pm$ 1.55	4.69 $\pm$ 2

Table 11. Table 11: Results of RankG, RESIT and AbPNL methods on 7 nodes with β ∼ U ( − 10 , 10 ) similar-to 𝛽 𝑈 10 10 \beta\sim U(-10,10) for Gumbel noise ( 100 100 100 repetitions).

-	Gumbel noise
-	RankG	AbPNL	RESIT
100	3.13 $\pm$ 2.02	6.11 $\pm$ 2.22	4.17 $\pm$ 1.76
500	0.88 $\pm$ 1.4	4.32 $\pm$ 1.65	4.41 $\pm$ 1.74
1000	0.38 $\pm$ 0.93	4.7 $\pm$ 1.87	4.22 $\pm$ 1.97
1500	0.37 $\pm$ 0.97	4.88 $\pm$ 1.78	4.88 $\pm$ 1.87
2000	0.14 $\pm$ 0.55	4.81 $\pm$ 1.94	5.02 $\pm$ 2.06

Table 12. Table 12: Results of RankG, RESIT and AbPNL methods on 7 nodes with β ∼ U ( − 10 , 10 ) similar-to 𝛽 𝑈 10 10 \beta\sim U(-10,10) for Logistic noise ( 100 100 100 repetitions).

-	Logistic noise
-	RankG	AbPNL	RESIT
100	2.65 $\pm$ 2	6.83 $\pm$ 2.2	4.03 $\pm$ 1.54
500	0.68 $\pm$ 1.09	4.87 $\pm$ 1.82	4.12 $\pm$ 1.77
1000	0.45 $\pm$ 1.01	5.08 $\pm$ 1.79	4.27 $\pm$ 1.8
1500	0.34 $\pm$ 0.98	4.89 $\pm$ 1.87	4.55 $\pm$ 1.76
2000	0.2 $\pm$ 0.68	4.43 $\pm$ 1.69	4.56 $\pm$ 1.81

Table 13. Table 13: Results of RankS, RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 100 , 100 ) similar-to 𝛽 𝑈 100 100 \beta\sim U(-100,100) for Gaussian noise ( 100 100 100 repetitions).

-	Gaussian noise
-	RankS	RankG	AbPNL	RESIT
100	2.23 $\pm$ 1.42	2.51 $\pm$ 1.26	3.94 $\pm$ 1.09	2.4 $\pm$ 1.16
150	1.73 $\pm$ 1.49	2.75 $\pm$ 1.25	2.6 $\pm$ 1.2	2.47 $\pm$ 1.13
200	1.63 $\pm$ 1.34	2.24 $\pm$ 1.32	2.87 $\pm$ 1.19	2.3 $\pm$ 1.02
250	1.56 $\pm$ 1.27	2.3 $\pm$ 1.16	3.07 $\pm$ 1.15	2.4 $\pm$ 1.22
300	1.49 $\pm$ 1.29	2.24 $\pm$ 1.35	2.97 $\pm$ 1.02	2.54 $\pm$ 1.1

Table 14. Table 14: Results of RankS, RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 100 , 100 ) similar-to 𝛽 𝑈 100 100 \beta\sim U(-100,100) for Gumbel noise ( 100 100 100 repetitions).

-	Gumbel noise
-	RankS	RankG	AbPNL	RESIT
100	1.69 $\pm$ 1.36	2.86 $\pm$ 1.36	3.98 $\pm$ 1.36	2.3 $\pm$ 1.18
150	1.76 $\pm$ 1.28	2.56 $\pm$ 1.4	2.88 $\pm$ 1.34	2.37 $\pm$ 1.1
200	1.77 $\pm$ 1.38	2.78 $\pm$ 1.3	2.77 $\pm$ 1.17	2.38 $\pm$ 1.25
250	1.73 $\pm$ 1.27	2.44 $\pm$ 1.18	3.04 $\pm$ 1.13	2.45 $\pm$ 1.16
300	1.84 $\pm$ 1.23	2.4 $\pm$ 1.43	3.11 $\pm$ 1.12	2.35 $\pm$ 1.1

Table 15. Table 15: Results of RankS, RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 100 , 100 ) similar-to 𝛽 𝑈 100 100 \beta\sim U(-100,100) for Logistic noise ( 100 100 100 repetitions).

-	Logistic noise
-	RankS	RankG	AbPNL	RESIT
100	2.02 $\pm$ 1.32	2.61 $\pm$ 1.35	3.89 $\pm$ 1.11	2.31 $\pm$ 1.1
150	1.8 $\pm$ 1.26	2.58 $\pm$ 1.39	2.84 $\pm$ 1.26	2.7 $\pm$ 1.23
200	1.63 $\pm$ 1.28	2.5 $\pm$ 1.4	3.17 $\pm$ 1.25	2.55 $\pm$ 1.25
250	1.45 $\pm$ 1.23	2.21 $\pm$ 1.3	3.32 $\pm$ 1.23	2.3 $\pm$ 1.25
300	1.61 $\pm$ 1.32	2.3 $\pm$ 1.23	3 $\pm$ 1.04	2.4 $\pm$ 1.17

Table 16. Table 16: Results of RankS, RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 10 , 10 ) similar-to 𝛽 𝑈 10 10 \beta\sim U(-10,10) for Gaussian noise ( 100 100 100 repetitions).

-	Gaussian noise
-	RankS	RankG	AbPNL	RESIT
100	1.72 $\pm$ 1.45	1.21 $\pm$ 1.23	3.7 $\pm$ 1.28	2.72 $\pm$ 1.2
150	1.54 $\pm$ 1.36	1.11 $\pm$ 1.48	2.3 $\pm$ 1.41	2.69 $\pm$ 1.2
200	1.37 $\pm$ 1.37	0.71 $\pm$ 1.09	2.52 $\pm$ 1.29	2.72 $\pm$ 1.05
250	1.37 $\pm$ 1.28	0.77 $\pm$ 1.14	2.55 $\pm$ 1.07	2.78 $\pm$ 1.22
300	1.21 $\pm$ 1.27	0.48 $\pm$ 0.89	3.1 $\pm$ 1.28	2.57 $\pm$ 1.17

Table 17. Table 17: Results of RankS, RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 10 , 10 ) similar-to 𝛽 𝑈 10 10 \beta\sim U(-10,10) for Gumbel noise ( 100 100 100 repetitions).

-	Gumbel noise
-	RankS	RankG	AbPNL	RESIT
100	1.73 $\pm$ 1.43	1.44 $\pm$ 1.26	3.47 $\pm$ 1.13	2.86 $\pm$ 1.26
150	1.63 $\pm$ 1.37	1.22 $\pm$ 1.33	2.04 $\pm$ 1.19	2.64 $\pm$ 1.28
200	1.41 $\pm$ 1.31	0.81 $\pm$ 1.02	2.5 $\pm$ 1.08	2.7 $\pm$ 1.32
250	1.48 $\pm$ 1.26	0.62 $\pm$ 1.08	2.38 $\pm$ 1.25	2.58 $\pm$ 1.24
300	1.72 $\pm$ 1.42	0.69 $\pm$ 1.05	2.44 $\pm$ 1.19	2.76 $\pm$ 1.12

Table 18. Table 18: Results of RankS, RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 10 , 10 ) similar-to 𝛽 𝑈 10 10 \beta\sim U(-10,10) for Logistic noise ( 100 100 100 repetitions).

-	Logistic noise
-	RankS	RankG	AbPNL	RESIT
100	1.78 $\pm$ 1.4	1.05 $\pm$ 1.3	3.82 $\pm$ 1.15	2.5 $\pm$ 1.24
150	1.44 $\pm$ 1.34	0.74 $\pm$ 1.02	2.86 $\pm$ 1.25	2.56 $\pm$ 1.16
200	1.55 $\pm$ 1.27	0.57 $\pm$ 1.02	3.18 $\pm$ 1.24	2.37 $\pm$ 1.17
250	1.3 $\pm$ 1.14	0.36 $\pm$ 0.72	3.14 $\pm$ 1.25	2.28 $\pm$ 1.09
300	1.38 $\pm$ 1.43	0.33 $\pm$ 0.75	3.23 $\pm$ 1.15	2.51 $\pm$ 1.11

Table 19. Table 19: Results of RankS, RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 100 , 100 ) similar-to 𝛽 𝑈 100 100 \beta\sim U(-100,100) for Gaussian noise and quartic polynomial g 𝑔 g ( 100 100 100 repetitions).

-	Gaussian noise
-	RankS	RankG	AbPNL	RESIT
100	1.24 $\pm$ 0.92	3.33 $\pm$ 1.41	3.7 $\pm$ 1.32	2.75 $\pm$ 1.31
150	1.42 $\pm$ 1.07	2.96 $\pm$ 1.5	3.1 $\pm$ 1.49	2.56 $\pm$ 1.21
200	1.37 $\pm$ 1.1	3.26 $\pm$ 1.54	3.37 $\pm$ 1.34	2.42 $\pm$ 1.32
250	1.78 $\pm$ 1.18	2.8 $\pm$ 1.55	3.35 $\pm$ 1.23	2.33 $\pm$ 1.12
300	1.63 $\pm$ 1	3.32 $\pm$ 1.59	3.52 $\pm$ 1.23	2.64 $\pm$ 1.31

Table 20. Table 20: Results of RankS, RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 100 , 100 ) similar-to 𝛽 𝑈 100 100 \beta\sim U(-100,100) for Gumbel noise and quartic polynomial g 𝑔 g ( 100 100 100 repetitions).

-	Gumbel noise
-	RankS	RankG	AbPNL	RESIT
100	1.11 $\pm$ 0.92	3 $\pm$ 1.37	3.33 $\pm$ 1.36	2.59 $\pm$ 1.23
150	1.51 $\pm$ 1.01	3.14 $\pm$ 1.37	2.7 $\pm$ 1.27	2.78 $\pm$ 1.2
200	1.56 $\pm$ 0.96	3.29 $\pm$ 1.41	3.22 $\pm$ 1.22	2.81 $\pm$ 1.26
250	1.63 $\pm$ 0.93	3.52 $\pm$ 1.41	3.34 $\pm$ 1.08	2.94 $\pm$ 1.29
300	1.86 $\pm$ 0.98	3.41 $\pm$ 1.44	3.61 $\pm$ 1.15	2.73 $\pm$ 1.25

Table 21. Table 21: Results of RankS, RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 100 , 100 ) similar-to 𝛽 𝑈 100 100 \beta\sim U(-100,100) for Logistic noise and quartic polynomial g 𝑔 g ( 100 100 100 repetitions).

-	Logistic noise
-	RankS	RankG	AbPNL	RESIT
100	2.02 $\pm$ 1.32	2.61 $\pm$ 1.35	3.52 $\pm$ 1.24	2.31 $\pm$ 1.1
150	1.8 $\pm$ 1.26	2.58 $\pm$ 1.39	3.23 $\pm$ 1.18	2.7 $\pm$ 1.23
200	1.63 $\pm$ 1.28	2.5 $\pm$ 1.4	3.47 $\pm$ 1.09	2.55 $\pm$ 1.25
250	1.45 $\pm$ 1.23	2.21 $\pm$ 1.3	3.35 $\pm$ 1.07	2.3 $\pm$ 1.25
300	1.61 $\pm$ 1.32	2.3 $\pm$ 1.23	3.45 $\pm$ 1.41	2.4 $\pm$ 1.17

Table 22. Table 22: Results of RankS, RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 10 , 10 ) similar-to 𝛽 𝑈 10 10 \beta\sim U(-10,10) for Gaussian noise and quartic polynomial g 𝑔 g ( 100 100 100 repetitions).

-	Gaussian noise
-	RankS	RankG	AbPNL	RESIT
100	2.01 $\pm$ 1.38	2.92 $\pm$ 1.66	3.64 $\pm$ 1.28	2.63 $\pm$ 1.17
150	2.34 $\pm$ 1.29	2.75 $\pm$ 1.72	2.64 $\pm$ 1.43	2.8 $\pm$ 1.29
200	2.44 $\pm$ 1.42	2.7 $\pm$ 1.62	3.19 $\pm$ 1.35	2.81 $\pm$ 1.32
250	2.53 $\pm$ 1.49	2.42 $\pm$ 1.65	3.57 $\pm$ 1.27	2.63 $\pm$ 1.24
300	2.48 $\pm$ 1.57	2.79 $\pm$ 1.8	3.7 $\pm$ 1.2	2.58 $\pm$ 1.34

Table 23. Table 23: Results of RankS, RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 10 , 10 ) similar-to 𝛽 𝑈 10 10 \beta\sim U(-10,10) for Gumbel noise and quartic polynomial g 𝑔 g ( 100 100 100 repetitions).

-	Gumbel noise
-	RankS	RankG	AbPNL	RESIT
100	1.85 $\pm$ 1.17	2.99 $\pm$ 1.49	3.55 $\pm$ 1.13	2.83 $\pm$ 1.23
150	2.15 $\pm$ 1.21	2.68 $\pm$ 1.56	2.54 $\pm$ 1.43	2.84 $\pm$ 1.39
200	2.43 $\pm$ 1.37	2.69 $\pm$ 1.59	2.71 $\pm$ 1.27	2.87 $\pm$ 1.3
250	2.51 $\pm$ 1.33	2.9 $\pm$ 1.61	3.2 $\pm$ 1.26	3.05 $\pm$ 1.45
300	2.32 $\pm$ 1.35	3.03 $\pm$ 1.38	3.1 $\pm$ 1.18	2.94 $\pm$ 1.32

Table 24. Table 24: Results of RankS, RankG, RESIT and AbPNL methods on 4 nodes with β ∼ U ( − 10 , 10 ) similar-to 𝛽 𝑈 10 10 \beta\sim U(-10,10) for Logistic noise and quartic polynomial g 𝑔 g ( 100 100 100 repetitions).

-	Logistic noise
-	RankS	RankG	AbPNL	RESIT
100	2.12 $\pm$ 1.17	2.97 $\pm$ 1.45	3.49 $\pm$ 1.36	2.61 $\pm$ 1.38
150	2.29 $\pm$ 1.39	2.93 $\pm$ 1.42	3.09 $\pm$ 1.39	2.63 $\pm$ 1.19
200	2.4 $\pm$ 1.36	2.91 $\pm$ 1.54	3.5 $\pm$ 1.17	2.45 $\pm$ 1.2
250	2.34 $\pm$ 1.34	2.81 $\pm$ 1.5	3.34 $\pm$ 1.23	2.53 $\pm$ 1.15
300	2.48 $\pm$ 1.44	2.76 $\pm$ 1.72	3.29 $\pm$ 1.19	2.59 $\pm$ 1.24

Equations188

h (Y) = g (X) + ε,

h (Y) = g (X) + ε,

h (Y) = X^{T} β_{0} + ε,

h (Y) = X^{T} β_{0} + ε,

P (Y_{j} > Y_{i}

P (Y_{j} > Y_{i}

= P (ε_{i} - ε_{j} < (X_{j} - X_{i})^{T} β_{0} ∣ X_{j}, X_{i})

= Φ (\frac{( X _{j} - X _{i} ) ^{T} β _{0}}{2}),

ℓ_{p r l} (β) :=

ℓ_{p r l} (β) :=

+ \mathds 1 {Y_{j} \leq Y_{i}} lo g Φ (\frac{( X _{i} - X _{j} ) ^{T} β}{2}) .

\hat{β}_{p r l} := β \in R^{m} ar g max ℓ_{p r l} (β) .

\hat{β}_{p r l} := β \in R^{m} ar g max ℓ_{p r l} (β) .

\hat{β}_{p r l} - β_{0} = o_{P} (1) .

\hat{β}_{p r l} - β_{0} = o_{P} (1) .

\hat{F} (z) := \frac{1}{n + 1} i = 1 \sum n \mathbbm 1 {Z_{i} \leq z} .

\hat{F} (z) := \frac{1}{n + 1} i = 1 \sum n \mathbbm 1 {Z_{i} \leq z} .

F_{β} (z) := \frac{1}{n} i = 1 \sum n Φ (z - X_{i}^{T} β) .

F_{β} (z) := \frac{1}{n} i = 1 \sum n Φ (z - X_{i}^{T} β) .

\hat{h}_{G} (Y_{i}) := F_{\hat{β}_{p r l}}^{- 1} (\hat{F} (h (Y_{i}))), i = 1, \dots, n .

\hat{h}_{G} (Y_{i}) := F_{\hat{β}_{p r l}}^{- 1} (\hat{F} (h (Y_{i}))), i = 1, \dots, n .

S (β) := (2 n)^{- 1}

S (β) := (2 n)^{- 1}

\displaystyle+\mathds{1}\{Y_{j}\leq Y_{i}\}\Phi\left(\sqrt{n}(X_{i}-X_{j})^{T}\beta\right)\Big{)}.

\hat{θ} := θ \in R^{m - 1} ar g max S (θ, 1),

\hat{θ} := θ \in R^{m - 1} ar g max S (θ, 1),

\hat{θ} - θ_{0}_{2} = O_{P} (n^{- \frac{1}{2}})

\hat{θ} - θ_{0}_{2} = O_{P} (n^{- \frac{1}{2}})

Q (z, y, \hat{β}) := \frac{1}{n ( n - 1 )}

Q (z, y, \hat{β}) := \frac{1}{n ( n - 1 )}

\times Φ (n ((X_{j} - X_{i})^{T} \hat{β} - z))],

\hat{h} (y) := z \in Ω_{h} ar g max Q (z, y, \hat{β}),

\hat{h} (y) := z \in Ω_{h} ar g max Q (z, y, \hat{β}),

\hat{h} (y) := z \in R ar g max {Q (z, y, \hat{β}) - λ z^{2}} .

\hat{h} (y) := z \in R ar g max {Q (z, y, \hat{β}) - λ z^{2}} .

X^{(k)} = f^{(k)} (g^{(k)} (X^{(PA_{k})}) + ε^{(k)}), k = 1, \dots, m,

X^{(k)} = f^{(k)} (g^{(k)} (X^{(PA_{k})}) + ε^{(k)}), k = 1, \dots, m,

Π^{0} := {π : G^{π} is a super-graph of G^{0}} .

Π^{0} := {π : G^{π} is a super-graph of G^{0}} .

h^{(k)} (X^{(k)}) = (X^{(ND_{k})})^{T} β^{(k)} + ε^{(k)}, k = 1, \dots, m,

h^{(k)} (X^{(k)}) = (X^{(ND_{k})})^{T} β^{(k)} + ε^{(k)}, k = 1, \dots, m,

X^{(- k)} := (X^{(1)}, \dots, X^{(k - 1)}, X^{(k + 1)}, \dots, X^{(m)}) .

X^{(- k)} := (X^{(1)}, \dots, X^{(k - 1)}, X^{(k + 1)}, \dots, X^{(m)}) .

\overset{ε}{^}_{j}^{(k)} := \hat{h}^{(k)} (X_{j}^{(k)}) - (X_{j}^{(- k)})^{T} \hat{β}^{(k)}, j = 1, \dots, n .

\overset{ε}{^}_{j}^{(k)} := \hat{h}^{(k)} (X_{j}^{(k)}) - (X_{j}^{(- k)})^{T} \hat{β}^{(k)}, j = 1, \dots, n .

t_{k} := H S I C ({X_{j}^{(- k)}, \overset{ε}{^}_{j}^{(k)}}_{j = 1}^{n}), k = 1, \dots, m .

t_{k} := H S I C ({X_{j}^{(- k)}, \overset{ε}{^}_{j}^{(k)}}_{j = 1}^{n}), k = 1, \dots, m .

\overset{π}{^} (m) := k ar g min {t_{k}} .

\overset{π}{^} (m) := k ar g min {t_{k}} .

\overset{π}{^} = (\overset{π}{^} (1), \dots, \overset{π}{^} (m)) .

\overset{π}{^} = (\overset{π}{^} (1), \dots, \overset{π}{^} (m)) .

H S I C (P^{N, X^{(A)}}) > ξ,

H S I C (P^{N, X^{(A)}}) > ξ,

N := h (X^{(k)}) - (X^{(A)})^{T} β .

N := h (X^{(k)}) - (X^{(A)})^{T} β .

P (\overset{π}{^} \in Π^{0}) \to 1 as n \to \infty.

P (\overset{π}{^} \in Π^{0}) \to 1 as n \to \infty.

X^{(k)}=\Big{(}\sum_{j\in\textbf{PA}_{k}}\beta_{1j}X^{(j)}+\beta_{2j}(X^{(j)})^{2}+\varepsilon_{k}\Big{)}^{1/3}

X^{(k)}=\Big{(}\sum_{j\in\textbf{PA}_{k}}\beta_{1j}X^{(j)}+\beta_{2j}(X^{(j)})^{2}+\varepsilon_{k}\Big{)}^{1/3}

# {(i, j) : \overset{π}{^} (i) \to \overset{π}{^} (j) \in G and j < i} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference

MethodsTest

Full text

Rank-Based Causal Discovery for Post-Nonlinear Models

Grigor Keropyan

David Strieder

Mathias Drton

Technical University of Munich

Munich Center for Machine Learning

Technical University of Munich

Munich Center for Machine Learning

Abstract

Learning causal relationships from empirical observations is a central task in scientific research. A common method is to employ structural causal models that postulate noisy functional relations among a set of interacting variables. To ensure unique identifiability of causal directions, researchers consider restricted subclasses of structural causal models. Post-nonlinear (PNL) causal models constitute one of the most flexible options for such restricted subclasses, containing in particular the popular additive noise models as a further subclass. However, learning PNL models is not well studied beyond the bivariate case. The existing methods learn non-linear functional relations by minimizing residual dependencies and subsequently test independence from residuals to determine causal orientations. However, these methods can be prone to overfitting and, thus, difficult to tune appropriately in practice. As an alternative, we propose a new approach for PNL causal discovery that uses rank-based methods to estimate the functional parameters. This new approach exploits natural invariances of PNL models and disentangles the estimation of the non-linear functions from the independence tests used to find causal orientations. We prove consistency of our method and validate our results in numerical experiments.

1 INTRODUCTION

Discovering the causal structure of complex systems is an important question in various disciplines such as biology, economics, clinical medicine, or neuroscience (Opgen-Rhein and Strimmer,, 2007; Glymour et al.,, 2019; Moneta et al.,, 2013). The gold standard approach to exploration of causal relations is to perform controlled experiments in which researchers externally intervene in the system and observe the resulting changes to variables of interest. However, in many applications controlled experiments are not feasible due to high cost or for ethical reasons. In such cases, causal discovery based on only observational data can be a useful tool (Spirtes and Zhang,, 2016).

A common tool for modeling causal relations are Structural Equation Models (SEMs). In their general form, SEMs postulate noisy functional relationships between a set of interacting variables. In the fully general setting, causal discovery methods such as constraint-based and score-based methods can identify the underlying causal structure only up to Markov equivalence classes (Spirtes et al.,, 2000). Thus, the literature has also considered many restricted subclasses that enable unique identification (Drton et al.,, 2011; Hoyer et al.,, 2008; Peters et al.,, 2014; Zhang and Hyvärinen,, 2009). In this realm, post-nonlinear (PNL) causal models constitute one of the most general approaches. They are identifiable from the joint distribution under mild assumptions (Zhang and Hyvärinen,, 2009; Peters et al.,, 2014) and yet offer a rather flexible framework for modeling complex non-linear causal systems.

Existing methods for bivariate PNL causal discovery are based on estimating the functional relations by minimizing independence criteria (HSIC, mutual information, etc.) between the noise and potential parents in a first step, and performing independence tests to determine the causal structure in a second step (Zhang and Hyvärinen,, 2009; Uemura and Shimizu,, 2020). However, minimizing dependence to subsequently test for independence leads to potential overfitting and thus limits the PNL approach. Another approach considered by Tu et al., (2022) employs Optimal Transport theory for bivariate post-nonlinear and additive noise causal discovery, but it is not evident how to generalize their method to multivariate models. To our knowledge the only work that deals with multivariate PNL models is Uemura et al., (2022), where the authors generalize the bivariate method from Uemura and Shimizu, (2020) based on minimizing dependence and subsequently testing for independence.

In this article we present a new method for multivariate causal discovery in PNL models that disentangles the two tasks of learning the functional relations and learning the causal structure. Our method continues to learn the latter with the help of independence tests, but it employs rank-based methods to learn the functional relations and, in this way, avoids overfitting issues.

The remainder of the paper is organized as follows. In Section 2 we introduce the PNL rank regression methods that we use to learn the functional relations. We study the special case of linearity in the inner function and show consistency of our proposed rank-based functional parameter estimates. This special case includes general nonlinear functions using basis expansions. In Section 3 we discuss the causal order learning routine using our proposed rank-based estimates in a recursive process that finds sink nodes by independence testing. Furthermore, we show consistency of the causal order estimation for PNL models and present the results of a simulation study in Section 4, where we compare our method to existing causal learning methods. Section 5 concludes the paper.

2 PNL RANK REGRESSION

In this section we introduce the rank-based estimators that we use in the first step of our proposed causal learning algorithm. The goal is to employ these rank-based estimators to infer the functional relations among the variables and to obtain estimates of the stochastic noise terms in the model. In the second step, we use the estimated noise terms to test for independence.

Suppose we observe a sample of $n$ independent copies $(X_{1},Y_{1}),\dots,(X_{n},Y_{n})$ of a random vector $(X,Y)$ . We assume the data generating process follows a PNL model, that is, the response variable $Y\in\mathbb{R}$ is given by

[TABLE]

where $X\in\mathbb{R}^{m}$ is a continuous random vector and the stochastic error term $\varepsilon$ has mean zero with unknown continuous distribution, independent of $X$ . Furthermore, we assume that the function $h:\mathbb{R}\to\mathbb{R}$ is continuous and strictly increasing (thus, invertible) whereas $g:\mathbb{R}^{m}\to\mathbb{R}$ may be an arbitrary function.

Under similar assumptions, Zhang and Hyvärinen, (2009) suggested to estimate the noise $\varepsilon=h(Y)-g(X)$ by representing $h$ and $g$ with Multi-layer Perceptrons (MLPs) and minimizing mutual information with $X$ via gradient-based methods. However, the main drawback of their methodology is that this model can fit perfectly to any data by learning constant functions $h$ and $g$ . The estimated noise will be constant and thus always independent from $X$ .

To overcome this problem Uemura and Shimizu, (2020) implemented an additional auto-encoding structure in the minimization problem that enforces invertibility of the function $h$ . While this circumvents the problem of constant estimation of the function $h$ , there are further challenges that arise from minimizing dependence and subsequent testing for independence. Indeed, the complexity of the function class assumed for $g$ needs to be balanced very carefully with the available sample size. Otherwise, $g$ can be fitted perfectly such that $g(X)=h(Y)$ , in which case the functional estimates cancel and the estimated noise is always independent of $X$ . Uemura and Shimizu, (2020) used a fixed architecture for the function classes of $g$ and $h$ . As a result, especially for small sample sizes (compared to the complexity of the function class of $g$ ), their method is prone to overfitting and canceling the effect of the function $h$ . Such overfitting may then entail erroneous results in independence tests for causal structure learning.

To avoid the noted overfitting issues, we propose the following two-stage method to learn the functional relations. In the first stage, we leverage rank statistics to separately estimate the function $g$ , without any appeal to measures of dependence between the noise and the predictor $X$ . The strictly increasing function $h$ preserves the ranks of $Y$ and thus, using rank-based methods, we can avoid estimating $h$ at this stage. This circumvents the problem of $g$ matching $h$ . In the second step, we estimate the functional relation $h$ at all observed data points to obtain the required estimates of the noise.

In order to simplify the concept and a theoretical analysis of our proposed method, we assume in the following linearity of the function $g$ , i.e. $g(X)=X^{T}\beta_{0}$ for $\beta_{0}\in\mathbb{R}^{m}$ . This can also be seen as a first order Taylor approximation of an arbitrary functional relation.

Remark 2.1.

Our framework and the idea of disentangling learning and testing by employing rank-based objective functions can be easily extended to the nonlinear case. By employing basis expansions, MLPs or any parametric function class in combination with the proposed rank-based scores to learn the functional relations one can trace the steps of the presented linear case. For instance, consider the basis functions $\{b_{l}(\cdot):l=1,\dots,a_{n}\}$ , where $a_{n}\to\infty$ sufficiently slowly, similar to Bühlmann et al., (2014). Then we can represent the (nonlinear) function $g$ by $\sum_{l=1}^{a_{n}}\alpha_{l}b_{l}(\cdot)$ , where $\alpha_{l}\in\mathbb{R}$ for all $l=1,\dots,a_{n}$ , and employ our proposed framework. The simulations in Section 4 include an example.

We start by studying the special case of model (1) under the assumption of Gaussian noise and derive a computationally efficient algorithm for estimating the functional parameters. Further, in Subsection 2.2 we consider the general case without restricting the noise distribution. The main idea of our approach is to leverage rank likelihoods, however, using the full marginal rank likelihood is not computationally tractable. A common approach to circumvent calculating the full marginal rank likelihood is to employ approximate Monte Carlo methods, e.g., considered by Doksum, (1987). However, we observed that this approach does not work well in practice for values of $\beta_{0}$ larger than one. Thus, in our proposed framework we employ pairwise rank likelihoods to approximate the full marginal rank likelihood (in the Gaussian case) or the rank correlation function (in the general noise case).

2.1 Gaussian Case

We assume that the data generating process follows the model

[TABLE]

with some unknown $\beta_{0}\in\mathbb{R}^{m}$ . Furthermore, we assume that the noise $\varepsilon$ is standard normal distributed and propose the following computationally fast algorithm to estimate the functional relations. The idea of this method is based on Yu et al., (2021).

We exploit the fact that $h$ is a strictly increasing function and therefore preserves the ranks of $\{Y_{i}\}_{i=1}^{n}$ . The normality assumption yields $\varepsilon_{i}-\varepsilon_{j}\sim\mathcal{N}(0,2)$ and we obtain

[TABLE]

where $\Phi$ is the cumulative distribution function of the standard normal distribution. The normalized log pairwise rank likelihood function is then given by

[TABLE]

We estimate $\beta_{0}$ by maximizing $\ell_{prl}$ , that is,

[TABLE]

This defines a concave optimization problem, which leads to a computationally fast estimation routine for $\beta_{0}$ without precise knowledge or estimation of the function $h$ .

Proposition 2.1.

The log pairwise rank likelihood function $\ell_{prl}(\beta)$ defined in (2.1) is concave. Moreover, if we assume that $n>m$ , then $\ell_{prl}(\beta)$ is strictly concave.

The proof can be found in Appendix B.

Furthermore, the proposed estimator is consistent.

Theorem 2.1.

As $n\to\infty$ (and in particular $n>m$ ), it holds that

[TABLE]

For a proof we refer the reader to Appendix C.

In order to obtain an estimate of the noise, we estimate the transformation function $h$ in a second step. We employ the following computationally fast and consistent estimation routine proposed by Cuzick, (1988). In his proposal, Cuzick, (1988) considers non-random covariates, however, the results are applicable to our setup conditional on the observed data $\{X_{i}\}_{i=1}^{n}.$ The method exploits the normality assumption as well as knowledge of the ranks of $\{Y_{i}\}_{i=1}^{n}$ and thus of $\{h(Y_{i})\}_{i=1}^{n}$ .

Let $\hat{F}(z)$ be the adjusted empirical distribution function of $Z_{i}:=h(Y_{i})$ , that is

[TABLE]

Remark 2.2.

Note that we only require the ranks of $\{h(Y_{i})\}_{i=1}^{n}$ to obtain the estimate $\hat{F}(h(Y_{i}))$ . Since $h$ is a strictly increasing function, the ranks of $\{h(Y_{i})\}_{i=1}^{n}$ are given by the ranks of $\{Y_{i}\}_{i=1}^{n}$ .

We denote the cumulative distribution function of a randomly chosen $Z_{i}$ by $F_{\beta}(z)$ , that is

[TABLE]

Then an estimator for the functional relation $h$ at the sample points $\{Y_{i}\}_{i=1}^{n}$ is given by

[TABLE]

By extending this estimator to a step function on $\mathbb{R}$ , and under some additional assumptions, Cuzick, (1988) show that this estimator converges to $h$ almost surely at all continuity points of the function $h$ .

Remark 2.3.

In our setup, we assume $h$ is continuous. Furthermore, the additional assumptions that ensure consistency in the setting from Cuzick, (1988) are mainly smoothness and moment conditions on the distributions of $X$ and $\varepsilon$ . In the considered Gaussian setting most of them are already satisfied. For a detailed list of the assumptions, we refer the reader to Appendix A.

2.2 General Case

The problem of estimating the parameter $\beta_{0}$ without additional assumptions on the distribution of the noise in model (2) is extensively studied in the literature, see i.e. Doksum, (1987); Han, (1987); Sherman, (1993); Abrevaya, 1999a ; Abrevaya, 1999b ; Abrevaya, (2003); Cavanagh and Sherman, (1998); Zhang, (2013). Without any restriction on expectation or variance of the noise the function $h$ is not unique, since it can be replaced by location or scale transformations. Thus, to ensure unique identification, we assume that there exists a known $y_{0}$ such that $h(y_{0})=0$ and we scale the last element of $\beta_{0}$ to $1$ , that is, $\beta_{0}=(\theta_{0},1)$ .

To simplify the optimization, we employ a method introduced by Lin and Peng, (2013) that utilizes the rank-based objective function

[TABLE]

Remark 2.4.

Lin and Peng, (2013)** used the assumption $\left\lVert\beta_{0}\right\rVert_{2}=1$ to ensure unique identification, which is equivalent to our assumption $\beta_{0}=(\theta_{0},1)$ up to rescaling.

To obtain a sparse solution the authors focused on a penalized version of $S(\beta)$ . However, as we are only interested in estimating the residuals and do not necessarily need sparsity, we adapted their analysis for the following simplified non-penalized estimator $\hat{\beta}:=(\hat{\theta},1)$ , where

[TABLE]

by setting their penalty term to zero.

Under some additional assumptions that mainly ensure smoothness of the distributions of $X$ and $\varepsilon$ , Lin and Peng, (2013) prove the existence of a local maximizer $\hat{\theta}$ of $S(\theta,1)$ with

[TABLE]

and, thus, $\hat{\beta}$ defines a consistent estimator for $\beta_{0}$ . This estimator uses only the rank information without requiring concrete knowledge of $h$ . A detailed list of the additional assumptions that ensure consistency can be found in Appendix A.

In a second step we estimate the function $h$ in order to subsequently obtain an estimate of the noise. We used the method introduced in Chen, (2002) based on the rank correlation. To simplify the complex optimization of discrete objective functions, we employ a smoothed version introduced by Zhang, (2013).

The smoothed rank correlation objective function is defined by

[TABLE]

where $d_{jy}:=\mathds{1}(Y_{j}\geq y)$ and $d_{iy_{0}}:=\mathds{1}(Y_{i}\geq y_{0})$ . Then we define an estimator of the function $h$ at $y$ via

[TABLE]

where $\Omega_{h}$ is an appropriate compact set.

In Theorem 4.1, Zhang, (2013) establish consistency of the proposed estimator for $h$ under a few assumptions, that include $\sqrt{n}$ -consistency of the involved estimator $\hat{\beta}$ and strict monotonicity of the function $h$ , as well as some additional regularity assumptions. A detailed list of the additional assumptions can be found in Appendix A.

Without restricting the optimization space by $\Omega_{h}$ the problem (6) is ill-posed in the sense that $\hat{h}(y)\to\infty$ for $y=\max\{Y_{i}\}_{i=1}^{n}$ and $\hat{h}(y)\to-\infty$ for $y=\min\{Y_{i}\}_{i=1}^{n}$ . To circumvent the issue of choosing a proper compact set $\Omega_{h}$ , we added an $L_{2}$ regularization term and optimized over $\mathbb{R}$ , that is,

[TABLE]

In the experiments we used the regularization parameter $\lambda=10^{-3}$ , which turned out to be small enough to not affect the estimated values significantly and at the same time bounded the objective function for the observed extremes.

3 LEARNING PNL MODELS

By combining the previously introduced rank-based estimators of the functional relations in PNL models we obtain estimates of the stochastic error terms. Using these rank-based estimated error terms, we propose a routine to learn the underlying causal structure by recursively identifying sink nodes via independence testing. Further, we show that our proposed routine consistently recovers a valid causal ordering under identifiability assumptions.

Suppose we observe data in form of $n$ independent copies $X_{1},\dots,X_{n}$ from a random vector $X:=(X^{(1)},\dots,X^{(m)})$ . We assume that $X$ follows a PNL causal model, that is, the data generating process is defined by the structural equations

[TABLE]

where $\textbf{PA}_{k}$ , called the parents of $X^{(k)}$ , are a subset of $\{1,\dots,m\}\setminus\{k\}$ . The causal perspective stems from viewing those equations as making assignments. Each variable on the left-hand side is assigned the value specified on the right-hand side, given by the value of its parents and a stochastic error term. The causal structure inherent in such structural equations is naturally represented by a directed graph $\mathcal{G}^{0}$ , where edges indicate which other variables each variable causally depends upon. As in related work, we assume the corresponding directed graph to be acyclic (DAG). The noise variables $\{\varepsilon^{(k)}\}_{k=1}^{m}$ are assumed to be mutually independent and $\varepsilon^{(k)}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 3.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 3.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 3.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 3.0mu{\scriptscriptstyle\perp}}}X^{(\textbf{PA}_{k})}$ for each $k=1,\dots,m$ . The main ansatz for inferring the causal structure is to leverage the independence structure of the stochastic noise $\varepsilon^{(k)}$ and a correctly specified parent set, that is, $\varepsilon^{(k)}$ is independent of all $X^{(j)}$ that precede $X^{(k)}$ in at least one true causal ordering of the underlying graph.

We focus on inferring the causal ordering to reduce the computational burden, however, the framework can easily be adapted to infer the specific causal graph structure by pruning redundant edges. The causal ordering of a graph $\mathcal{G}^{0}$ is given by a permutation $\pi$ of $\{1,\dots,m\}$ , such that, if there exists a directed edge from node $\pi(i)$ to node $\pi(j)$ in the graph then $i<j$ . We emphasize that the causal ordering for a given graph is not necessarily unique but each causal ordering $\pi$ corresponds to a unique, fully connected DAG $\mathcal{G}^{\pi}$ , where $\mathcal{G}^{\pi}$ has a directed edge from node $\pi(i)$ to node $\pi(j)$ if and only if $i<j$ . Thus, similar to Bühlmann et al., (2014), we can define the set of true causal orderings $\Pi^{0}$ for any DAG $\mathcal{G}^{0}$ as the set of all causal orderings $\pi$ that correspond to fully connected DAGs $\mathcal{G}^{\pi}$ which contain $\mathcal{G}^{0}$ as a sub-graph, that is

[TABLE]

Remark 3.1.

In general $\Pi^{0}$ contains more than one element and all elements correspond to valid causal orderings of the DAG $\mathcal{G}^{0}$ (e.g. in the extreme case of an empty graph, all permutations are true causal orderings).

In order to apply the previously introduced rank-based regression methods, we assume that all functions $f^{(k)}$ in the data generating PNL causal model are continuous and strictly increasing, and thus, we can define their inverse via $h^{(k)}:=(f^{(k)})^{-1}$ . Further, we assume that all $g^{(k)}$ are linear and the distribution of every stochastic error $\varepsilon^{(k)}$ is assumed to be continuous. We emphasize again, our method is applicable to nonlinear functional relations by means of basis expansions.

Put together, each structural equation, i.e. each cause-effect relation, corresponds to a PNL regression model (1) as introduced in the previous section. That is, the data generating process follows

[TABLE]

where the non-descendants $\textbf{ND}_{k}$ are given by all nodes that precede node $k$ in at least one true causal ordering and $\varepsilon^{(k)}$ is independent of $X^{(\textbf{ND}_{k})}$ . Note that the entries in $\beta^{(k)}$ which do not correspond to parents of node $k$ are simply zero. To leverage the independence $\varepsilon^{(k)}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 3.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 3.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 3.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 3.0mu{\scriptscriptstyle\perp}}}X^{(\textbf{ND}_{k})}$ , we define

[TABLE]

For every sink node $k$ in the graph, we have $\textbf{ND}_{k}=\{1,\dots,m\}\setminus\{k\}$ , and, thus, for every sink node $k$ the noise $\varepsilon^{(k)}$ is independent of $X^{(-k)}$ . Moreover, if node $k$ is not a sink node in the graph, then the noise $\varepsilon^{(k)}$ is not independent of $X^{(-k)}$ , since $X^{(-k)}$ contains at least one child of $k$ . Thus, we can recursively identify a sink node using the HSIC (Gretton et al.,, 2005) measure of independence between the estimated noise $\hat{\varepsilon}^{(k)}$ and the remaining nodes $X^{(-k)}$ .

We propose the following routine to learn one of the valid causal orderings of the graph. First, we utilize the rank-based estimators $\hat{h}^{(k)}$ and $\hat{\beta}^{(k)}$ , introduced in the previous section, and estimate the noise via

[TABLE]

We repeat this noise estimation for all remaining nodes $k\in\{1,\dots,m\}$ and subsequently calculate the HSIC test statistic between the estimated noises and the observed values of the remaining nodes $X^{(-k)}$ , that is

[TABLE]

We determine the node which leads to the minimal test statistic as a sink node, that is, our proposed sink node estimator is defined by

[TABLE]

In the next step we remove $\hat{\pi}(m)$ from the set $\{1,\dots,m\}$ and repeat the sink node identification procedure to estimate $\hat{\pi}(m-1)$ . Thus, recursively we obtain an estimate for the causal ordering

[TABLE]

In the following Theorem we prove consistency of our proposed estimation routine for the causal ordering. It is clear that if at any step our method fails to correctly identify a remaining sink node, then it fails to estimate a valid causal ordering. Thus, we must require sink node identifiability from the joint distribution in order to ensure consistency of the estimated causal order. The following assumption (A) formalizes this intuition.

(A) For each $k\in[1,m]$ and $A\subset[1,m]\setminus\{k\}$ that contains at least one child of $X^{(k)}$ as well as for all strictly increasing, continuous functions $h:\mathbb{R}\to\mathbb{R}$ and for all $\beta\in\mathbb{R}^{|A|}$ , there exists a constant $\xi>0$ , such that

[TABLE]

where

[TABLE]

Theorem 3.1.

Under assumption (A) and consistency of the employed estimators $\hat{h}^{(k)}$ and $\hat{\beta}^{(k)}$ we have

[TABLE]

The proof can be found in Appendix D.

Remark 3.2.

Location and scale transformations of the noise variables can be matched by transformations in the functions $h^{(k)}$ , however, these transformations do not change the dependence structure and thus without loss of generality, we can assume that the location and scale assumptions in Section 2 are satisfied.

4 SIMULATIONS

In this section we present the results of a simulation study with the aim to compare the performance of our algorithm to existing causal learning methods on various simulated data sets and validate the consistency results experimentally. If necessary in the specific application, our method can be easily extended to infer the full causal graph structure, however, this vastly increases the computation time, similar to the related existing methods. Since the main differences of the algorithms already come into play during the causal order estimation procedure, we compared the performance on this task alone. Our experiments were designed as follows. We randomly sampled a causal structure with $m$ ( $m=4$ and $m=7$ ) nodes from Erdős–Rényi directed acyclic graphs with edge probability $\frac{2}{m-1}$ . Thus, the expected total number of edges in the graph is $m$ . We generated data according to the corresponding post-nonlinear model using the following structural relations

[TABLE]

for $k=1,\dots,m,$ with different noise distributions (standard Normal $\mathcal{N}(0,1)$ , standard Gumbel $Gumbel(0,1)$ or standard Logistic $Logistic(0,1)$ ). Further, $\beta_{1j}$ and $\beta_{2j}$ are sampled from a uniform distribution with either a low range from $-10$ to $10$ , representing a weak signal setting (low signal-to-noise ratio (SNR)), or a higher range from $-100$ to $100$ , representing a strong signal setting (high SNR).

Remark 4.1.

We highlight that the inner function $g^{(k)}$ is quadratic in the parents of $X^{(k)}$ . In the experiments, we used polynomial basis expansions of order two to linearly model the inner function by specifying not only parents but also their squares.

Using this process, we generated $100$ independent data sets and estimated the causal ordering. We compared our results with the classical RESIT method for additive noise models (Peters et al.,, 2014) and the AbPNL method for post-nonlinear models (Uemura et al.,, 2022) restricted to the respective causal order estimation parts. The causal ordering of a given DAG is not necessarily unique, thus, as a measure of performance for an estimated causal ordering $\hat{\pi}$ we report the number of directed edges $\hat{\pi}(i)\to\hat{\pi}(j)$ in the true graph $\mathcal{G}$ with $j<i$ , that is

[TABLE]

This measure equals zero when $\hat{\pi}$ is a valid causal ordering and achieves its maximum, the number of edges in $\mathcal{G}$ , when $\hat{\pi}$ is a reversed causal ordering.

Figure 1 shows the performance of the Gaussian method introduced in Subsection 2.1, named RankG, compared to RESIT and AbPNL on 4- and 7-dimensional causal graphs in settings with Gaussian, Gumbel (standard extreme value distribution) and Logistic noise. Dashed lines indicate the weak signal setting and solid lines depict the strong signal setting. We plot the mean of our performance measure, the number of wrongly oriented edges in the fully connected DAG corresponding to the estimated causal ordering, over all 100 data sets against the sample size.

Our proposed RankG method outperforms the competition in almost all considered settings, especially in the low SNR setting. We emphasize that it might seem counterintuitive that the RankG method performs better in a low SNR setting than in a high SNR setting, however, the noise drives the identification in the rank-based learning procedure in PNL models. Thus, higher noise in comparison to the signal strengths induces more changes in the ranks that propagate through the graph, and thus, better performance of the rank-based estimation methods.

Furthermore, in the low SNR setting our computational results support the theoretical consistency results and our method seems to recover a valid causal ordering even in moderate sample sizes.

The results in Figure 1 display that even under noise misspecification, that is, in the Gumbel and Logistic noise cases, the RankG method performs best. This might indicate some robustness of our proposed method for causal order estimation, even though we do not recover the true noise under misspecification (see Figure 6).

We conducted similar experiments to compare the performance of our proposed general method introduced in Subsection 2.2, named RankS, however with lower sample sizes for computational reasons. Figure 2 shows the results of the different competitors RankS, RankG, RESIT and AbPNL for 4-dimensional graphs with Gaussian, Gumbel and Logistic noise. Our proposed method RankS performs best in all considered sample sizes and all noise cases, except for the weak signal settings where RankG performs better. This might stem from the fact that the pairwise rank likelihood used in RankG better approximates the marginal rank likelihood.

In additional experiments, we investigated the behavior of the introduced methods for more complex functional relations $g$ , namely a polynomial of degree 4. Similar to the experiments before, we sampled data from Erdős–Rényi DAGs, however, with the structural relations

[TABLE]

for $k=1,\dots,m,$ with different noise distributions and $\beta_{1j},\beta_{2j},\beta_{3j}$ and $\beta_{4j}$ sampled with low or high SNR. Figure 3 shows the resulting mean performance measures for the different methods RankS, RankG, RESIT and AbPNL over 100 data sets. As expected, the task becomes more challenging by increasing the complexity of the functional parameters, since it is difficult to estimate the functional relations in the first place. However, even in the considered low sample sizes, our proposed RankS method seems to detect some causal structure and outperform the competition.

Further, we analysed the behaviour of the used functional estimators introduced in Section 2 by performing the following experiments. We generated 100 independent data sets of sample size 500 according to model (2) with a 3-dimensional predictor $X$ , standard Gaussian or Gumbel noise, cubic function $h(y)=y^{3}$ and linear parameters $\beta_{0}=(10,5,1)$ . Then we estimated the functional parameter $\beta_{0}$ and the function $h$ with the introduced rank-based methods.

Figure 4 shows the estimation of the function $h$ for one representative result across the 100 replications. The red lines indicate the true value of the function $h$ , while the black dots indicate the pointwise estimates. We notice that the estimation of the function $h$ with the rank-based method that relies on the normality assumption (used in RankG) fails to correctly estimate the functional relation at extremes under Gumbel noise. This is due to the misspecified tail probability structure. In contrast, the general estimation method employed in RankS is not influenced by the specific underlying noise distribution. However, in the Gaussian noise case we notice small estimation bias, which can be regulated with the hyperparameter $\lambda$ in the estimation procedure. Further, from the results across all 100 data sets, we noticed that the variance of the functional estimate across the data sets is higher using the general estimation methods in RankS.

Figure 5 shows the estimation of the first two entries of $\beta_{0}$ . Recall that in the general RankS method we fixed the last entry in our estimation of $\beta_{0}$ . The box plots show the estimated values of $\beta_{0}$ across the 100 data sets and red dots indicate the true values. We see again that RankG estimates the parameter $\beta_{0}$ with a bias in the misspecified Gumbel noise setting, similar to the estimation of the function $h$ . However, in the Gaussian setting the RankG method estimation of $\beta_{0}$ has a lower variance than RankS and in all other cases the median estimate corresponds to the true value.

Figure 6 shows the estimated noise by combining both estimation results. We plot the estimated noise against the true values for one representative data set (similar to Figure 4). Red lines correspond to a perfect estimation. The RankG method inherits the behaviour from both estimation parts and fails to correctly estimate the extreme noise cases. However, it outperforms the RankS method in the Gaussian setting.

5 CONCLUSION

We proposed a new routine for causal discovery in multivariate post-nonlinear structural equation models. Our method disentangles the two tasks of estimating the functional relations and learning the causal structure by employing rank-based methods for the first task. Thus, our proposed routine is less susceptible to overfitting issues exhibited by the existing methods that rely on minimizing dependence and subsequently testing for independence.

We introduced PNL rank regression methods to learn the functional relations in PNL models and subsequently estimate the residuals. As a special case, we first considered Gaussian noise and used pairwise rank likelihoods in a computationally fast algorithm, whereas, for the general noise case, we employed a smoothed version of rank correlations to obtain estimates of the functional relations. While our presentation focused on linearity in the inner functional relation to simplify the theoretical analysis, the framework includes nonlinear relations by means of basis expansions. Employing the introduced estimators of the functional relations, we proposed a causal learning routine to recursively identify sink nodes based on independence tests with the estimated residuals. Further, we prove consistent causal order recovery of our proposed routine under identifiability assumptions and consistency of the employed functional estimators.

We validated our theoretical findings in a simulation study that showed that our proposed routine outperforms the competition and is able to recover a valid causal ordering even in moderate sample sizes.

Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 83818), the German Federal Ministry of Education and Research, and the Bavarian State Ministry for Science and the Arts. The authors of this work take full responsibility for its content.

Appendix A ADDITIONAL ASSUMPTIONS

In this section, we list all additional assumptions that are mentioned in the main paper in detail.

Cuzick, (1988) employs the following 4 (additional) assumptions to ensure the consistency of $\hat{h}_{G}$ defined in (5). Thus, the following assumptions together with the modelling assumptions and assumption (A) in the main paper ensure the consistency of the proposed RankG method.

AG1

Let $G_{n}(x):=\frac{1}{n}\sum_{j=1}^{n}\mathds{1}\{X_{j}\leq x\}$ . Assume $Z_{n}$ has distribution $G_{n}$ , then $|Z_{n}|^{t}$ is uniformly integrable for some $t$ which is specified in the next assumption (at least $t>4$ ). 2. AG2

For $F_{\beta}(z)$ defined in (4), there exists finite $K$ such that

[TABLE]

for all $n$ and $\beta\in B$ , where $B$ is a neighborhood of $\beta_{0}$ and $\alpha+t^{-1}<\frac{1}{2}$ . 3. AG3

The function $Z(\beta):=\mathbb{E}[X|X^{T}(\beta-\beta_{0})=0]$ is $L_{2}$ continuous as $\beta\to\beta_{0}$ . 4. AG4

The following inequality holds

[TABLE]

The main assumptions above essentially correspond to moment conditions on the distribution of $X$ .

Following Lin and Peng, (2013) (AS1-AS4) and Zhang, (2013) (AS5-AS9), we list the additional assumptions that are required to ensure consistency of the estimators in Section 2.2. Thus, in combination with the modelling assumptions and assumption (A) in the main paper, the RankS method is consistent.

AS1

Let $g$ be the density function of $(X_{j}-X_{i})^{T}\beta_{0}$ and $F$ the distribution function of $\varepsilon_{j}-\varepsilon_{i}$ . Define the functions $\Gamma(s):=\mathbb{E}[(X_{j}-X_{i})^{T}|(X_{j}-X_{i})^{\beta}_{0}=s]$ and $\Omega(s):=F(s)\Gamma(s)g(s)$ . Then $\Omega^{\prime}(0)$ is nonsingular. 2. AS2

The density $g$ is positive with a continuous second derivative on its corresponding compact support. 3. AS3

The function $F$ has a continuous second derivative on its corresponding support. 4. AS4

The random variable $X$ is bounded with a compact support. 5. AS5

The true parameter $\theta_{0}$ is an interior point of a compact subset $\Theta\subset\mathbb{R}^{m-1}$ . 6. AS6

The support of $X$ is not contained in a linear subspace of $\mathbb{R}^{m}$ . Moreover, conditional on the first $m-1$ components of $X$ , the last component of $X$ has a density function with respect to the Lebesgue measure. 7. AS7

Define

[TABLE]

and let

[TABLE]

There exists a neighborhood $\mathcal{N}$ of $\theta_{0}$ such that for each pair of $(y,x)$ in the support of $(Y,X)$ the following hold

•

The second derivatives of $\tau(y,x,\theta)$ with respect to $\theta$ exist in $\mathcal{N}$ .

•

There exists an integrable function $M(y,x)$ such that for all $\theta$ in $\mathcal{N}$

[TABLE]

•

$\mathbb{E}[(|\nabla_{1}|\tau(Y,X,\theta_{0}))^{2}]<\infty$ .

•

$\mathbb{E}[|\nabla_{2}|\tau(Y,X,\theta_{0})]<\infty$ .

•

The matrix $\mathbb{E}[\nabla_{2}\tau(Y,X,\theta_{0})]$ is strictly negative definite. 8. AS8

There exists $\epsilon^{*}>0$ and $y_{1},y_{2}$ in the support of $Y$ such that $[h(y_{1}-\epsilon^{*}),h(y_{2}+\epsilon^{*})]$ is contained in a compact interval. 9. AS9

For $\omega_{1}=(x^{1},y^{1})$ , $\omega_{2}=(x^{2},y^{2})$ and $W=(X,Y)$ we define

[TABLE]

Further, we define

[TABLE]

then

[TABLE]

is negative for each $y\in[y_{1},y_{2}]$ and uniformly bounded away from 0.

Zhang, (2013) show uniform consistency on the interval $[y_{1},y_{2}]$ defined in the assumptions AS8-AS9.

Appendix B PROOF OF PROPOSITION 2.1

To prove Proposition 2.1 we employ the following two Lemmas.

Lemma B.1.

For all $z\in\mathbb{R}$ we have

[TABLE]

where $\Phi$ and $\phi$ denote the CDF and PDF of the standard normal distribution.

Proof.

With $h(z):=\phi^{\prime}(z)\Phi(z)-(\phi(z))^{2}$ , we show that $h(z)<0$ . Substituting the derivative of $\phi$ we have

[TABLE]

Since $\phi(z)>0$ for all $z\in\mathbb{R}$ it remains to show that $g(z):=-z\Phi(z)-\phi(z)<0$ . For $z\geq 0$ this is clear. Thus we consider the case $z<0$ . We have for the derivative of $g$

[TABLE]

Therefore, $g$ is a strictly decreasing function. The limit of $g(z)$ for $z\to-\infty$ is given by

[TABLE]

and the claim follows. ∎

Lemma B.2.

The function

[TABLE]

is concave, where $\Phi$ is the CDF of the standard normal distribution and $c\in\mathbb{R}^{m}$ is a nonzero constant. Moreover, $v^{T}\nabla^{2}f(x)v=0$ for a vector $v\in\mathbb{R}^{m}$ and Hessian matrix $\nabla^{2}f(x)$ if and only if $v^{T}c=0$ .

Proof.

The function $f$ is twice differentiable, and thus for the first part of the Lemma it is enough to show that the Hessian of $f(x)$ is negative semi-definite. The gradient of $f$ is given by

[TABLE]

where $\phi$ is the PDF of the standard normal distribution. Thus, the Hessian is

[TABLE]

So, for any $v\in\mathbb{R}^{m}$ we have

[TABLE]

where the last step follows from Lemma B.1 and the fact that $(v^{T}c)^{2}\geq 0$ .

Moreover, $v^{T}\nabla^{2}f(x)v=0$ if and only if $v^{T}c=0$ , which completes the proof. ∎

Employing the Lemmas we can prove Proposition 2.1.

Proof.

From (2.1) we have

[TABLE]

Thus, with Lemma B.2 we know that $\ell_{prl}(\beta)$ is a sum of concave functions. Since sums preserve the concavity $\ell_{prl}(\beta)$ is concave.

We show strict concavity by contradiction and thus assume that $\ell_{prl}(\beta)$ is not strictly concave. This implies that there exists a vector $v$ such that $v^{T}\nabla^{2}\ell_{prl}(\beta)v=0$ for the Hessian matrix $\nabla^{2}\ell_{prl}(\beta)$ of $\ell_{prl}(\beta)$ . The Hessian operator is linear, thus, $\nabla^{2}\ell_{prl}(\beta)$ is a sum of Hessians, that is

[TABLE]

Lemma B.2 gives that $v^{T}\nabla^{2}\log\Phi\left(\frac{(X_{j}-X_{i})^{T}\beta}{\sqrt{2}}\right)v=0$ if and only if $v^{T}(X_{j}-X_{i})=0$ . Since for each $i,j$ , either one of $\mathds{1}(Y_{j}>Y_{i})$ or $\mathds{1}(Y_{j}\leq Y_{i})$ is 1, $v^{T}\nabla^{2}\ell_{prl}(\beta)v=0$ implies that $v^{T}(X_{j}-X_{i})=0$ for all $i$ and $j$ . Therefore, $X_{j}^{T}v=c$ for a constant $c\in\mathbb{R}$ for all $j$ . Let $X^{(i)}=(X^{(i)}_{1},\dots,X^{(i)}_{n})$ be the sample vector of the $i-$ th component of $X$ and define the matrix $\mathbf{X}:=[\mathbf{1},X^{(1)},\dots,X^{(m)}]\in\mathbb{R}^{n\times m+1}$ . Then $\mathbf{X}$ does not have full column rank, i.e. taking $u=(-c,v^{T})^{T}$ implies $\mathbf{X}u=\mathbf{0}$ .

However, if we take any arbitrary square sub-matrix in $\mathbf{X}$ and compute the determinant, we obtain a non-zero polynomial of some $X^{(1)},\dots,X^{(m)}$ . The Lemma in (Okamoto,, 1973) states that such a polynomial is zero only on the Lebesgue measure zero. Therefore, the rank of $\mathbf{X}$ is $\min\{n,m+1\}=m+1$ , which contradicts the equality $\mathbf{X}u=0$ and, thus, completes the proof. ∎

Appendix C PROOF OF THEOREM 2.1

To keep the formulas readable, we define $U_{ij}:=\frac{X_{i}-X_{j}}{\sqrt{2}}$ . This gives

[TABLE]

For the proof, we use the Taylor expansion of $\ell_{prl}(\hat{\beta}_{PRL})$ around $\beta_{0}$ and use properties of the gradient of $\ell_{prl}(\beta)$ at $\beta_{0}$ , which are established in the following Lemmas.

Lemma C.1.

The gradient $\nabla_{\beta}\ell_{prl}(\beta_{0})$ converges to zero almost surely, that is

[TABLE]

Proof.

The gradient

[TABLE]

is a U-statistic with kernel

[TABLE]

Note that $\psi((X_{1},Y_{1}),(X_{2},Y_{2}))$ is symmetric as there are no ties in $Y_{i}$ ’s ( $Y$ has a continuous distribution).

We use the generalization of the classical Strong Law of Large Numbers (SLLN) for U-statistics (e.g. Theorem A in (Serfling,, 1980) Section 5.4). The expectation of the kernel $\psi$ is

[TABLE]

where the first equalities follow from the linearity of the expectation and the tower rule of the expectation, i.e. $\mathbb{E}[Q(X,Y)]=\mathbb{E}[\mathbb{E}[Q(X,Y)|X]]$ for any function $Q$ . The third equality is a classical porperty of the indicator function. The fourth equality uses the monotonicity of the function $h$ . The fifth step is just a cancellation of the equal members in the fractions. Finally, the sixth equality follows from the fact that $\phi(x)=\phi(-x)$ for the probability density function $\phi$ of the standard normal distribution.

We show that the absolute value of the kernel $\psi$ has a finite expectation, since

[TABLE]

where we used the triangle inequality for the absolute value in the first step and the other steps are similar to the calculation for the expectation of the kernel $\psi$ . The last quantity is finite since the probability density function of the standard normal distribution is bounded.

Thus, the SLLN yields

[TABLE]

which completes the proof of the Lemma. ∎

Lemma C.2.

Let $a>0$ . For all $\beta\in\mathbb{R}^{m}$ , such that $\left\lVert\beta-\beta_{0}\right\rVert_{2}=a$ , and $n>m$ we have

[TABLE]

Proof.

The taylor expansion of $\ell_{prl}(\beta)$ around $\beta_{0}$ gives

[TABLE]

where $\left\lVert\beta^{*}-\beta_{0}\right\rVert_{2}\leq\left\lVert\beta-\beta_{0}\right\rVert_{2}=a$ , i.e., $\beta^{*}$ is in the closed ball around $\beta_{0}$ with radius $a$ . Clearly, $\nabla_{\beta}^{2}\ell_{prl}(\beta^{*})$ is continuous with respect to $\beta^{*}$ . Moreover, from Proposition 2.1 we know that the maximum eigenvalue of $\nabla_{\beta}^{2}\ell_{prl}$ is negative. Thus, there exist a maximum eigenvalue $\lambda_{max}<0$ of $\nabla_{\beta}^{2}\ell_{prl}$ within the closed ball with center $\beta_{0}$ and radius $a$ .

Therefore, we have

[TABLE]

Using the previous Lemma C.1 the remaining term $(\beta-\beta_{0})^{T}\nabla_{\beta}\ell_{prl}(\beta_{0})$ asymptotically vanishes in probability and thus the claim follows. ∎

In the following we prove Theorem 2.1., i.e. $\hat{\beta}_{prl}-\beta_{0}=o_{P}(1)$ .

Proof.

From Lemma C.2 we know that for any fixed $a>0$ , there exists a maximum of $\ell_{prl}(\beta)$ within the closed ball around $\beta_{0}$ with radius $a$ with probability tends to 1. However, $\hat{\beta}_{prl}$ maximizes $\ell_{prl}(\beta)$ , and thus

[TABLE]

which completes the proof. ∎

Appendix D PROOF OF THEOREM 3.1

To prove the Theorem we employ the following Lemma.

Lemma D.1.

Under the assumption ( $\mathbf{A}$ ) we have

•

If node $k$ is not a sink node, then

[TABLE]

•

If node $k$ is a sink node, then

[TABLE]

Proof.

First, assume that $k$ is not a sink node. Denoting $\hat{\varepsilon}:=\hat{h}(X^{(k)})-\left(X^{(-k)}\right)^{T}\hat{\beta}^{(k)}$ , assumption ( $\mathbf{A}$ ) gives that $HSIC(\mathbb{P}^{X^{(-k)},\hat{\varepsilon}})>\xi$ . On the other hand, Theorem 3 in Gretton et al., (2005) gives that for all $\delta>0$ , with probability at least $1-\delta$ we have

[TABLE]

where $\alpha$ and $C$ are constants. Thus,

[TABLE]

Second, assume that $k$ is a sink node. Then $\{X_{j}^{(k)},X_{j}^{(-k)}\}_{j=1}^{n}$ is an i.i.d. sample from the PNL regression model (1). Using the (point-wise) consistency of the estimators, we have

[TABLE]

Thus, we obtain

[TABLE]

From the definition of $t_{k}$ and the decomposition of the HSIC, we have

[TABLE]

where

[TABLE]

and $\hat{Q}_{1}:=\frac{1}{n^{2}}\sum_{i,j=1}^{n}K_{ij}\hat{L}_{ij},\quad\hat{Q}_{2}:=\frac{1}{n^{4}}\sum_{i,j,q,r=1}^{n}K_{ij}\hat{L}_{ij},,\quad\hat{Q}_{3}:=\frac{2}{n^{3}}\sum_{i,j,q=1}^{n}K_{ij}\hat{L}_{ij}$ .

Similar to the proof of Theorem 2 in Teran Hidalgo et al., (2018) we will show that $t_{k}-HSIC(\mathbb{P}^{X^{(-k)},\varepsilon^{k}})=o_{P}(1)$ . From Lemma 1 in Gretton et al., (2005) we have

[TABLE]

where $L_{ij}:=\exp(-(\varepsilon_{i}^{(k)}-\varepsilon_{j}^{(k)})^{2}),\quad Q_{1}:=\mathbb{E}_{X_{1}^{(-k)},\varepsilon_{1}^{(k)},X_{2}^{(-k)},\varepsilon_{2}^{(k)}}[K_{12}L_{12}],\quad Q_{2}:=\mathbb{E}_{X_{1}^{(-k)},X_{2}^{(-k)}}[K_{12}]\mathbb{E}_{\varepsilon_{1}^{(k)},\varepsilon_{2}^{(k)}}[L_{12}]$ and $Q_{3}:=2\mathbb{E}_{X_{1}^{(-k)},\varepsilon_{1}^{(k)}}[\mathbb{E}_{X_{1}^{(-k)}}[K_{12}]\mathbb{E}_{\varepsilon_{1}^{(k)}}[L_{12}]]$ .

We show that $\hat{Q}_{1}-Q_{1}=o_{P}(1)$ . We have

[TABLE]

where we used $K_{jj}=\hat{L}_{jj}=1$ in the second equality. For $\delta>0$ , the Markov inequality gives

[TABLE]

Here, we used that $K_{ij}$ and $\hat{L}_{ij}$ are bounded by 1, thus, their variances are bounded. The number of terms in the quadruple sum that have exactly three different indices are of order $n^{3}$ , but the denominator is of order $n^{4}$ , which leads to the second $O(1)$ term. The last $O(1)$ comes from the fact that the number of terms in the quadruple sum that have exactly four different indices are of order $n^{4}$ .

Using (7) and the continuous mapping theorem we obtain

[TABLE]

and since $\hat{L}_{12}$ is bounded, it is uniformly integrable and we obtain

[TABLE]

In a similar way, we obtain

[TABLE]

Using the fact that $K_{12}L_{12}$ is independent from $K_{34}L_{34}$ and employing the two equalities above , we have

[TABLE]

Thus, all terms in (9) are $o_{P}(1)$ . Moreover, from the above arguments we have $\mathbb{E}[K_{12}\hat{L}_{12}]-\mathbb{E}[K_{12}L_{12}]=o(1)$ and $\frac{1}{n^{2}(n-1)}\sum_{i\neq j}^{n}K_{ij}\hat{L}_{ij}=o(1)$ as $K_{ij}\hat{L}_{ij}$ is bounded and the denominator is of order $n^{3}$ . So, in (8) all the terms are $o_{P}(1)$ , which gives

[TABLE]

In the same fashion, one can show that $\hat{Q}_{2}-Q_{2}=o_{P}(1)$ and $\hat{Q}_{3}-Q_{3}=o_{P}(1)$ , which put together proves that $t_{k}-HSIC(\mathbb{P}^{X^{(-k)},\varepsilon^{k}})=o_{P}(1)$ . Moreover, $HSIC(\mathbb{P}^{X^{(-k)},\varepsilon^{k}})=0$ , since $X^{(k)}$ is a sink node and thus the noise $\varepsilon^{k}$ is independent from the remaining nodes $X^{(-k)}$ . Thus,

[TABLE]

which completes the proof of the Lemma. ∎

Now we can prove Theorem 3.1.

For any set $A\subseteq\{1,2,\dots,m\}$ we denote the sub-graph of $\mathcal{G}^{0}$ over the nodes $A$ as $\mathcal{G}^{0}_{A}$ . Then

[TABLE]

The proof of Theorem 3.1 is completed if we show that

[TABLE]

since this implies

[TABLE]

Using assumption ( $\mathbf{A}$ ) and the recursive construction of $\hat{\pi}$ it is enough to prove (10) for $k=m$ , that is

[TABLE]

By Lemma D.1 we know that $t_{k}$ goes to zero in probability for sink nodes and is a least $\xi$ for other nodes. Thus,

[TABLE]

This completes the proof of the Theorem.

Appendix E NUMERICAL RESULTS

In this Section we provide the concrete simulation results in tabular form and additionally provide the empirical standard deviations if the results.

Tables 1, 2, 3 and 4, 5, 6 show the results of RankG, AbPNL and RESIT methods for 4-dimensional graphs with strong and weak signal settings, for Gaussian, Gumbel and Logistic noises, respectively. Tables 7, 8, 9 and 10, 11, 12 show the results for 7-dimensional graphs. In both cases the sample sizes are 100, 500, 1000, 1500, and 2000, respectively.

Moreover, Tables 13, 14, 15 and 16, 17, 18 show the results of RankS, RankG, AbPNL and RESIT methods for 4-dimensional graphs with strong and weak signal settings, respectively. Tables 19, 20, 21 and 22, 23, 24 show the results of RankS, RankG, AbPNL and RESIT methods for 4-dimensional graphs where the function $g$ is polynomial of degree 4 with strong and weak signal settings, respectively. In both cases the sample sizes are 100, 150, 200, 250, and 300.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Abrevaya, J. (1999 a). Computation of the maximum rank correlation estimator. Economics Letters , 62(3):279–285.
2(2) Abrevaya, J. (1999 b). Leapfrog estimation of a fixed-effects model with unknown transformation of the dependent variable. Journal of Econometrics , 93(2):203–228.
3Abrevaya, (2003) Abrevaya, J. (2003). Pairwise-difference rank estimation of the transformation model. Journal of Business & Economic Statistics , 21(3):437–447.
4Bühlmann et al., (2014) Bühlmann, P., Peters, J., and Ernest, J. (2014). CAM: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics , 42(6):2526 – 2556.
5Cavanagh and Sherman, (1998) Cavanagh, C. L. and Sherman, R. P. (1998). Rank estimators for monotonic index models. Journal of Econometrics , 84:351–381.
6Chen, (2002) Chen, S. (2002). Rank estimation of transformation models. Econometrica , 70(4):1683–1697.
7Cuzick, (1988) Cuzick, J. (1988). Rank regression. The Annals of Statistics , 16(4):1369–1389.
8Doksum, (1987) Doksum, K. A. (1987). An Extension of Partial Likelihood Methods for Proportional Hazard Models to General Transformation Models. The Annals of Statistics , 15(1):325 – 345.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Abstract

1 INTRODUCTION

2 PNL RANK REGRESSION

Remark 2.1**.**

2.1 Gaussian Case

Proposition 2.1**.**

Theorem 2.1**.**

Remark 2.2**.**

Remark 2.3**.**

2.2 General Case

Remark 2.4**.**

3 LEARNING PNL MODELS

Remark 3.1**.**

Theorem 3.1**.**

Remark 3.2**.**

4 SIMULATIONS

Remark 4.1**.**

5 CONCLUSION

Acknowledgements

Appendix A ADDITIONAL ASSUMPTIONS

Appendix B PROOF OF PROPOSITION 2.1

Lemma B.1**.**

Proof.

Lemma B.2**.**

Proof.

Proof.

Appendix C PROOF OF THEOREM 2.1

Lemma C.1**.**

Proof.

Lemma C.2**.**

Proof.

Proof.

Appendix D PROOF OF THEOREM 3.1

Lemma D.1**.**

Proof.

Appendix E NUMERICAL RESULTS

Remark 2.1.

Proposition 2.1.

Theorem 2.1.

Remark 2.2.

Remark 2.3.

Remark 2.4.

Remark 3.1.

Theorem 3.1.

Remark 3.2.

Remark 4.1.

Lemma B.1.

Lemma B.2.

Lemma C.1.

Lemma C.2.

Lemma D.1.