Interpretable and intervenable ultrasonography-based machine learning   models for pediatric appendicitis

Ri\v{c}ards Marcinkevi\v{c}s; Patricia Reis Wolfertstetter; Ugne; Klimiene; Kieran Chin-Cheong; Alyssia Paschke; Julia Zerres; Markus; Denzinger; David Niederberger; Sven Wellmann; Ece Ozkan; Christian Knorr,; Julia E. Vogt

arXiv:2302.14460·cs.LG·November 27, 2023

Interpretable and intervenable ultrasonography-based machine learning models for pediatric appendicitis

Ri\v{c}ards Marcinkevi\v{c}s, Patricia Reis Wolfertstetter, Ugne, Klimiene, Kieran Chin-Cheong, Alyssia Paschke, Julia Zerres, Markus, Denzinger, David Niederberger, Sven Wellmann, Ece Ozkan, Christian Knorr,, Julia E. Vogt

PDF

1 Repo

TL;DR

This paper develops interpretable machine learning models using ultrasound images to diagnose pediatric appendicitis, enabling clinicians to understand and intervene without sacrificing accuracy.

Contribution

It introduces concept bottleneck models extended for multiple views and incomplete data, tailored for ultrasound-based appendicitis diagnosis.

Findings

01

Achieved AUROC of 0.80 and AUPR of 0.92 in diagnosis prediction.

02

Models are interpretable and do not require extensive image annotation.

03

Performance is comparable to black-box neural networks.

Abstract

Appendicitis is among the most frequent reasons for pediatric abdominal surgeries. Previous decision support systems for appendicitis have focused on clinical, laboratory, scoring, and computed tomography data and have ignored abdominal ultrasound, despite its noninvasive nature and widespread availability. In this work, we present interpretable machine learning models for predicting the diagnosis, management and severity of suspected appendicitis using ultrasound images. Our approach utilizes concept bottleneck models (CBM) that facilitate interpretation and interaction with high-level concepts understandable to clinicians. Furthermore, we extend CBMs to prediction problems with multiple views and incomplete concept sets. Our models were trained on a dataset comprising 579 pediatric patients with 1709 ultrasound images accompanied by clinical and laboratory data. Results show that our…

Tables16

Table 1. Table 1: The contingency table of the pediatric appendicitis dataset of the management (M) by severity (S) stratified by the diagnosis (D).

	D: appendicitis
	complicated	uncomplicated	Total
surgical	97	135	232
conservative	0	151	151
Total	97	286	383
	D: no appendicitis
	complicated	uncomplicated	Total
surgical	0	2	2
conservative	0	194	194
Total	0	196	196

Table 2. Table 2: Explanation and descriptive statistics for the concept variables chosen for the pediatric appendicitis dataset. All concept variables are binary. The right-most column reports the percentage of the positive outcome values.

	Name	Description	Pos., %
$c_{1}$	Visibility of the appendix	visibility of the vermiform appendix during the examination	76
$c_{2}$	Free intraperitoneal fluid	free fluids in the abdomen	43
$c_{3}$	Appendix layer structure	characterization of the appendix layers, e.g. irregular in case of an increasing inflammation	14
$c_{4}$	Target sign	axial image of the appendix with the fluid-filled center surrounded by echogenic mucosa and submucosa and hypoechoic muscularis	13
$c_{5}$	Surrounding tissue reaction	inflammation signs in tissue surrounding the appendix	33
$c_{6}$	Pathological lymph nodes	enlarged and inflamed intra-abdominal lymph nodes	21
$c_{7}$	Thickening of the bowel wall	edema of the intestinal wall, $>$ 2–3 mm	8
$c_{8}$	Coprostasis	fecal impaction in the colon	6
$c_{9}$	Meteorism	accumulation of gas in the intestine	15

Table 3. Table 3: Models’ test-set performance at concept prediction on the pediatric appendicitis dataset with the diagnosis as the target variable. Test-set AUROCs and AUPRs are reported as averages and standard deviations across ten independent initializations. Herein, “seq” and “joint” denote sequential and joint optimization, respectively, whereas “avg” and “LSTM” stand for the averaging- and LSTM-based fusion. AUROCs and AUPRs that are significantly greater than the expected performance of a fair coin flip (random) are marked by “ * ”. Bold indicates the best result; italics indicates the second best. The meaning of the concept variables: c 1 subscript 𝑐 1 c_{1} , visibility of the appendix; c 2 subscript 𝑐 2 c_{2} , free intraperitoneal fluid; c 3 subscript 𝑐 3 c_{3} , appendix layer structure; c 4 subscript 𝑐 4 c_{4} , target sign; c 5 subscript 𝑐 5 c_{5} , surrounding tissue reaction; c 6 subscript 𝑐 6 c_{6} , pathological lymph nodes; c 7 subscript 𝑐 7 c_{7} , thickening of the bowel wall; c 8 subscript 𝑐 8 c_{8} , coprostasis; c 9 subscript 𝑐 9 c_{9} , meteorism.

Metric	Model	Concept
Metric	Model	$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	$c_{8}$	$c_{9}$
AUROC	Random	0.50	0.50	0.50	0.50	0.50	0.50	0.50	0.50	0.50
	CBM-seq	0.52 $\pm$ 0.04	0.47 $\pm$ 0.04	0.60 $\pm$ 0.07^*	0.56 $\pm$ 0.08	0.63 $\pm$ 0.05^*	0.57 $\pm$ 0.05^*	0.45 $\pm$ 0.08	0.48 $\pm$ 0.08	0.39 $\pm$ 0.07
	CBM-joint	0.50 $\pm$ 0.05	0.47 $\pm$ 0.03	0.57 $\pm$ 0.05^*	0.54 $\pm$ 0.06	0.64 $\pm$ 0.04^*	0.59 $\pm$ 0.05^*	0.39 $\pm$ 0.06	0.57 $\pm$ 0.12	0.38 $\pm$ 0.09
	MVCBM-seq-avg	0.61 $\pm$ 0.05^*	0.49 $\pm$ 0.05	0.66 $\pm$ 0.08^*	0.60 $\pm$ 0.08^*	0.51 $\pm$ 0.08	0.66 $\pm$ 0.08^*	0.50 $\pm$ 0.04	0.47 $\pm$ 0.12	0.55 $\pm$ 0.07
	MVCBM-seq-LSTM	0.83 $\pm$ 0.03^*	0.59 $\pm$ 0.03^*	0.62 $\pm$ 0.04^*	0.71 $\pm$ 0.04^*	0.65 $\pm$ 0.04^*	0.67 $\pm$ 0.07^*	0.49 $\pm$ 0.07	0.68 $\pm$ 0.10^*	0.73 $\pm$ 0.06^*
	MVCBM-joint-avg	0.55 $\pm$ 0.10	0.47 $\pm$ 0.07	0.73 $\pm$ 0.07^*	0.63 $\pm$ 0.07^*	0.61 $\pm$ 0.06^*	0.63 $\pm$ 0.07^*	0.48 $\pm$ 0.06	0.45 $\pm$ 0.13	0.54 $\pm$ 0.11
	MVCBM-joint-LSTM	0.85 $\pm$ 0.03^*	0.55 $\pm$ 0.04^*	0.58 $\pm$ 0.04^*	0.70 $\pm$ 0.03^*	0.75 $\pm$ 0.02^*	0.55 $\pm$ 0.09	0.45 $\pm$ 0.12	0.68 $\pm$ 0.17	0.77 $\pm$ 0.03^*
	SSMVCBM-avg	0.62 $\pm$ 0.05^*	0.60 $\pm$ 0.05^*	0.72 $\pm$ 0.05^*	0.67 $\pm$ 0.05^*	0.54 $\pm$ 0.05	0.68 $\pm$ 0.08^*	0.53 $\pm$ 0.11	0.43 $\pm$ 0.08	0.47 $\pm$ 0.07
	SSMVCBM-LSTM	0.85 $\pm$ 0.04^*	0.58 $\pm$ 0.06^*	0.66 $\pm$ 0.05^*	0.71 $\pm$ 0.06^*	0.67 $\pm$ 0.04^*	0.69 $\pm$ 0.06^*	0.45 $\pm$ 0.09	0.66 $\pm$ 0.11^*	0.73 $\pm$ 0.05^*
AUPR	Random	0.72	0.49	0.19	0.23	0.51	0.26	0.16	0.13	0.14
	CBM-seq	0.71 $\pm$ 0.03	0.53 $\pm$ 0.03^*	0.29 $\pm$ 0.06^*	0.26 $\pm$ 0.05	0.64 $\pm$ 0.05^*	0.38 $\pm$ 0.06^*	0.15 $\pm$ 0.03	0.12 $\pm$ 0.02	0.11 $\pm$ 0.02
	CBM-joint	0.73 $\pm$ 0.05	0.49 $\pm$ 0.04	0.30 $\pm$ 0.06^*	0.30 $\pm$ 0.08	0.64 $\pm$ 0.05^*	0.38 $\pm$ 0.09^*	0.15 $\pm$ 0.05	0.19 $\pm$ 0.08	0.11 $\pm$ 0.02
	MVCBM-seq-avg	0.79 $\pm$ 0.04^*	0.53 $\pm$ 0.06	0.34 $\pm$ 0.10^*	0.35 $\pm$ 0.10^*	0.53 $\pm$ 0.07	0.41 $\pm$ 0.07^*	0.17 $\pm$ 0.04	0.14 $\pm$ 0.04	0.25 $\pm$ 0.12
	MVCBM-seq-LSTM	0.92 $\pm$ 0.02^*	0.59 $\pm$ 0.04^*	0.32 $\pm$ 0.05	0.38 $\pm$ 0.04^*	0.67 $\pm$ 0.04^*	0.42 $\pm$ 0.10^*	0.15 $\pm$ 0.02	0.21 $\pm$ 0.08	0.40 $\pm$ 0.11^*
	MVCBM-joint-avg	0.75 $\pm$ 0.08	0.48 $\pm$ 0.06	0.38 $\pm$ 0.09^*	0.30 $\pm$ 0.06	0.58 $\pm$ 0.05^*	0.39 $\pm$ 0.08^*	0.21 $\pm$ 0.08	0.15 $\pm$ 0.08	0.16 $\pm$ 0.05
	MVCBM-joint-LSTM	0.94 $\pm$ 0.01^*	0.50 $\pm$ 0.05	0.26 $\pm$ 0.08	0.37 $\pm$ 0.07^*	0.74 $\pm$ 0.04^*	0.32 $\pm$ 0.09	0.16 $\pm$ 0.08	0.31 $\pm$ 0.20	0.28 $\pm$ 0.07^*
	SSMVCBM-avg	0.79 $\pm$ 0.04^*	0.58 $\pm$ 0.03^*	0.38 $\pm$ 0.05^*	0.34 $\pm$ 0.04^*	0.54 $\pm$ 0.06	0.42 $\pm$ 0.08^*	0.20 $\pm$ 0.06	0.12 $\pm$ 0.04	0.17 $\pm$ 0.07
	SSMVCBM-LSTM	0.93 $\pm$ 0.03^*	0.60 $\pm$ 0.06^*	0.31 $\pm$ 0.06^*	0.38 $\pm$ 0.06^*	0.67 $\pm$ 0.04^*	0.39 $\pm$ 0.06^*	0.19 $\pm$ 0.06	0.19 $\pm$ 0.07	0.30 $\pm$ 0.09^*

Table 4. Table 4: Models’ test-set performance at concept prediction on the appendicitis dataset with the management as the target variable. Test-set AUROCs and AUPRs are reported as averages and standard deviations across ten independent initializations. Herein, “seq” and “joint” denote sequential and joint optimization, respectively, whereas “avg” and “LSTM” stand for the averaging- and LSTM-based fusion. AUROCs and AUPRs that are significantly greater than the expected performance of a fair coin flip (random) are marked by “ * ”. Bold indicates the best result; italics indicates the second best. The meaning of the concept variables: c 1 subscript 𝑐 1 c_{1} , visibility of the appendix; c 2 subscript 𝑐 2 c_{2} , free intraperitoneal fluid; c 3 subscript 𝑐 3 c_{3} , appendix layer structure; c 4 subscript 𝑐 4 c_{4} , target sign; c 5 subscript 𝑐 5 c_{5} , surrounding tissue reaction; c 6 subscript 𝑐 6 c_{6} , pathological lymph nodes; c 7 subscript 𝑐 7 c_{7} , thickening of the bowel wall; c 8 subscript 𝑐 8 c_{8} , coprostasis; c 9 subscript 𝑐 9 c_{9} , meteorism.

Metric	Model	Concept
Metric	Model	$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	$c_{8}$	$c_{9}$
AUROC	Random	0.50	0.50	0.50	0.50	0.50	0.50	0.50	0.50	0.50
	CBM-seq	0.51 $\pm$ 0.05	0.54 $\pm$ 0.07	0.63 $\pm$ 0.05^*	0.49 $\pm$ 0.07	0.65 $\pm$ 0.07^*	0.56 $\pm$ 0.06	0.47 $\pm$ 0.10	0.60 $\pm$ 0.10	0.54 $\pm$ 0.07
	CBM-joint	0.54 $\pm$ 0.08	0.51 $\pm$ 0.08	0.64 $\pm$ 0.06^*	0.49 $\pm$ 0.06	0.67 $\pm$ 0.03^*	0.54 $\pm$ 0.07	0.49 $\pm$ 0.07	0.56 $\pm$ 0.10	0.47 $\pm$ 0.09
	MVCBM-seq-avg	0.62 $\pm$ 0.06^*	0.48 $\pm$ 0.07	0.69 $\pm$ 0.03^*	0.54 $\pm$ 0.12	0.49 $\pm$ 0.08	0.60 $\pm$ 0.07^*	0.48 $\pm$ 0.09	0.47 $\pm$ 0.13	0.57 $\pm$ 0.09
	MVCBM-seq-LSTM	0.86 $\pm$ 0.05^*	0.55 $\pm$ 0.05	0.62 $\pm$ 0.05^*	0.69 $\pm$ 0.03^*	0.66 $\pm$ 0.04^*	0.65 $\pm$ 0.06^*	0.50 $\pm$ 0.07	0.75 $\pm$ 0.09^*	0.74 $\pm$ 0.06^*
	MVCBM-joint-avg	0.52 $\pm$ 0.07	0.53 $\pm$ 0.06	0.71 $\pm$ 0.07^*	0.59 $\pm$ 0.05^*	0.64 $\pm$ 0.07^*	0.65 $\pm$ 0.04^*	0.48 $\pm$ 0.10	0.54 $\pm$ 0.07	0.52 $\pm$ 0.15
	MVCBM-joint-LSTM	0.80 $\pm$ 0.05^*	0.41 $\pm$ 0.08	0.66 $\pm$ 0.07^*	0.61 $\pm$ 0.04^*	0.66 $\pm$ 0.03^*	0.62 $\pm$ 0.07^*	0.51 $\pm$ 0.07	0.62 $\pm$ 0.11	0.63 $\pm$ 0.08^*
	SSMVCBM-avg	0.62 $\pm$ 0.07^*	0.57 $\pm$ 0.08	0.73 $\pm$ 0.04^*	0.63 $\pm$ 0.05^*	0.55 $\pm$ 0.04	0.65 $\pm$ 0.07^*	0.50 $\pm$ 0.08	0.49 $\pm$ 0.08	0.52 $\pm$ 0.05
	SSMVCBM-LSTM	0.84 $\pm$ 0.02^*	0.54 $\pm$ 0.05	0.70 $\pm$ 0.05^*	0.70 $\pm$ 0.03^*	0.68 $\pm$ 0.05^*	0.62 $\pm$ 0.07^*	0.50 $\pm$ 0.10	0.72 $\pm$ 0.05^*	0.72 $\pm$ 0.10^*
AUPR	Random	0.72	0.49	0.19	0.23	0.51	0.26	0.16	0.13	0.14
	CBM-seq	0.76 $\pm$ 0.03	0.55 $\pm$ 0.07	0.37 $\pm$ 0.09^*	0.23 $\pm$ 0.03	0.66 $\pm$ 0.07^*	0.35 $\pm$ 0.10	0.19 $\pm$ 0.06	0.20 $\pm$ 0.13	0.17 $\pm$ 0.03
	CBM-joint	0.77 $\pm$ 0.04^*	0.51 $\pm$ 0.06	0.45 $\pm$ 0.08^*	0.24 $\pm$ 0.07	0.64 $\pm$ 0.04^*	0.29 $\pm$ 0.04	0.19 $\pm$ 0.05	0.17 $\pm$ 0.09	0.15 $\pm$ 0.06
	MVCBM-seq-avg	0.79 $\pm$ 0.04^*	0.52 $\pm$ 0.08	0.35 $\pm$ 0.04^*	0.31 $\pm$ 0.14	0.51 $\pm$ 0.06	0.37 $\pm$ 0.08^*	0.17 $\pm$ 0.04	0.12 $\pm$ 0.04	0.18 $\pm$ 0.05
	MVCBM-seq-LSTM	0.95 $\pm$ 0.02^*	0.55 $\pm$ 0.03^*	0.32 $\pm$ 0.08^*	0.38 $\pm$ 0.04^*	0.66 $\pm$ 0.03^*	0.38 $\pm$ 0.09^*	0.16 $\pm$ 0.02	0.30 $\pm$ 0.16	0.30 $\pm$ 0.06^*
	MVCBM-joint-avg	0.71 $\pm$ 0.04	0.53 $\pm$ 0.05	0.36 $\pm$ 0.10^*	0.28 $\pm$ 0.03^*	0.60 $\pm$ 0.07^*	0.39 $\pm$ 0.06^*	0.17 $\pm$ 0.05	0.20 $\pm$ 0.07	0.21 $\pm$ 0.10
	MVCBM-joint-LSTM	0.91 $\pm$ 0.03^*	0.44 $\pm$ 0.05	0.31 $\pm$ 0.06^*	0.33 $\pm$ 0.06^*	0.64 $\pm$ 0.03^*	0.38 $\pm$ 0.06^*	0.19 $\pm$ 0.04	0.19 $\pm$ 0.11	0.28 $\pm$ 0.14
	SSMVCBM-avg	0.78 $\pm$ 0.06	0.60 $\pm$ 0.07^*	0.41 $\pm$ 0.08^*	0.33 $\pm$ 0.08^*	0.55 $\pm$ 0.05	0.39 $\pm$ 0.07^*	0.22 $\pm$ 0.06	0.12 $\pm$ 0.02	0.23 $\pm$ 0.08
	SSMVCBM-LSTM	0.93 $\pm$ 0.01^*	0.55 $\pm$ 0.06	0.38 $\pm$ 0.09^*	0.37 $\pm$ 0.06^*	0.67 $\pm$ 0.06^*	0.35 $\pm$ 0.06^*	0.17 $\pm$ 0.05	0.24 $\pm$ 0.05^*	0.27 $\pm$ 0.08^*

Table 5. Table 5: Models’ test-set performance at concept prediction on the appendicitis dataset with the severity as the target variable. Test-set AUROCs and AUPRs are reported as averages and standard deviations across ten independent initializations. Herein, “seq” and “joint” denote sequential and joint optimization, respectively, whereas “avg” and “LSTM” stand for the averaging- and LSTM-based fusion. AUROCs and AUPRs that are significantly greater than the expected performance of a fair coin flip (random) are marked by “ * ”. Bold indicates the best result; italics indicates the second best. The meaning of the concept variables: c 1 subscript 𝑐 1 c_{1} , visibility of the appendix; c 2 subscript 𝑐 2 c_{2} , free intraperitoneal fluid; c 3 subscript 𝑐 3 c_{3} , appendix layer structure; c 4 subscript 𝑐 4 c_{4} , target sign; c 5 subscript 𝑐 5 c_{5} , surrounding tissue reaction; c 6 subscript 𝑐 6 c_{6} , pathological lymph nodes; c 7 subscript 𝑐 7 c_{7} , thickening of the bowel wall; c 8 subscript 𝑐 8 c_{8} , coprostasis; c 9 subscript 𝑐 9 c_{9} , meteorism.

Metric	Model	Concept
Metric	Model	$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	$c_{8}$	$c_{9}$
AUROC	Random	0.50	0.50	0.50	0.50	0.50	0.50	0.50	0.50	0.50
	CBM-seq	0.51 $\pm$ 0.04	0.58 $\pm$ 0.06^*	0.61 $\pm$ 0.08^*	0.52 $\pm$ 0.09	0.62 $\pm$ 0.04^*	0.62 $\pm$ 0.05^*	0.47 $\pm$ 0.09	0.57 $\pm$ 0.11	0.50 $\pm$ 0.08
	CBM-joint	0.55 $\pm$ 0.06	0.46 $\pm$ 0.06	0.66 $\pm$ 0.06^*	0.47 $\pm$ 0.06	0.64 $\pm$ 0.04^*	0.53 $\pm$ 0.07	0.50 $\pm$ 0.07	0.58 $\pm$ 0.10^*	0.49 $\pm$ 0.04
	MVCBM-seq-avg	0.54 $\pm$ 0.08	0.55 $\pm$ 0.04	0.72 $\pm$ 0.07^*	0.62 $\pm$ 0.04^*	0.50 $\pm$ 0.05	0.64 $\pm$ 0.06^*	0.51 $\pm$ 0.10	0.47 $\pm$ 0.11	0.54 $\pm$ 0.10
	MVCBM-seq-LSTM	0.82 $\pm$ 0.04^*	0.53 $\pm$ 0.04	0.62 $\pm$ 0.04^*	0.69 $\pm$ 0.04^*	0.62 $\pm$ 0.05^*	0.72 $\pm$ 0.05^*	0.64 $\pm$ 0.06^*	0.78 $\pm$ 0.03^*	0.70 $\pm$ 0.06^*
	MVCBM-joint-avg	0.54 $\pm$ 0.09	0.51 $\pm$ 0.06	0.70 $\pm$ 0.06^*	0.59 $\pm$ 0.08^*	0.61 $\pm$ 0.06^*	0.62 $\pm$ 0.05^*	0.54 $\pm$ 0.15	0.48 $\pm$ 0.14	0.55 $\pm$ 0.12
	MVCBM-joint-LSTM	0.82 $\pm$ 0.03^*	0.48 $\pm$ 0.06	0.66 $\pm$ 0.07^*	0.64 $\pm$ 0.06^*	0.65 $\pm$ 0.05^*	0.64 $\pm$ 0.09^*	0.47 $\pm$ 0.09	0.61 $\pm$ 0.14	0.65 $\pm$ 0.05^*
	SSMVCBM-avg	0.53 $\pm$ 0.06^*	0.56 $\pm$ 0.08^*	0.71 $\pm$ 0.05^*	0.60 $\pm$ 0.06^*	0.51 $\pm$ 0.05	0.64 $\pm$ 0.09^*	0.46 $\pm$ 0.08	0.48 $\pm$ 0.09	0.53 $\pm$ 0.03
	SSMVCBM-LSTM	0.77 $\pm$ 0.10^*	0.59 $\pm$ 0.08	0.70 $\pm$ 0.06^*	0.67 $\pm$ 0.07^*	0.65 $\pm$ 0.07^*	0.67 $\pm$ 0.05^*	0.62 $\pm$ 0.08^*	0.74 $\pm$ 0.15^*	0.64 $\pm$ 0.11^*
AUPR	Random	0.72	0.49	0.19	0.23	0.51	0.26	0.16	0.13	0.14
	CBM-seq	0.75 $\pm$ 0.03	0.58 $\pm$ 0.05^*	0.34 $\pm$ 0.09^*	0.24 $\pm$ 0.05	0.64 $\pm$ 0.04^*	0.35 $\pm$ 0.06^*	0.18 $\pm$ 0.05	0.19 $\pm$ 0.07	0.15 $\pm$ 0.03
	CBM-joint	0.77 $\pm$ 0.05	0.47 $\pm$ 0.04	0.37 $\pm$ 0.09^*	0.25 $\pm$ 0.06	0.64 $\pm$ 0.05^*	0.30 $\pm$ 0.07	0.17 $\pm$ 0.04	0.18 $\pm$ 0.06	0.18 $\pm$ 0.08
	MVCBM-seq-avg	0.75 $\pm$ 0.05	0.58 $\pm$ 0.06^*	0.42 $\pm$ 0.07^*	0.33 $\pm$ 0.06^*	0.53 $\pm$ 0.05	0.41 $\pm$ 0.08^*	0.21 $\pm$ 0.05	0.13 $\pm$ 0.05	0.24 $\pm$ 0.12
	MVCBM-seq-LSTM	0.91 $\pm$ 0.04^*	0.55 $\pm$ 0.04^*	0.33 $\pm$ 0.08^*	0.40 $\pm$ 0.06^*	0.65 $\pm$ 0.03^*	0.50 $\pm$ 0.11^*	0.23 $\pm$ 0.05^*	0.27 $\pm$ 0.05^*	0.26 $\pm$ 0.07^*
	MVCBM-joint-avg	0.74 $\pm$ 0.06	0.51 $\pm$ 0.07	0.42 $\pm$ 0.09^*	0.28 $\pm$ 0.07	0.59 $\pm$ 0.06^*	0.35 $\pm$ 0.05^*	0.22 $\pm$ 0.06	0.22 $\pm$ 0.13	0.21 $\pm$ 0.08
	MVCBM-joint-LSTM	0.92 $\pm$ 0.02^*	0.49 $\pm$ 0.05	0.37 $\pm$ 0.11^*	0.32 $\pm$ 0.07^*	0.65 $\pm$ 0.06^*	0.39 $\pm$ 0.07^*	0.20 $\pm$ 0.06	0.17 $\pm$ 0.07	0.21 $\pm$ 0.06^*
	SSMVCBM-avg	0.73 $\pm$ 0.05	0.58 $\pm$ 0.07^*	0.36 $\pm$ 0.05^*	0.28 $\pm$ 0.04^*	0.53 $\pm$ 0.05	0.37 $\pm$ 0.09^*	0.20 $\pm$ 0.06	0.13 $\pm$ 0.02	0.24 $\pm$ 0.06^*
	SSMVCBM-LSTM	0.88 $\pm$ 0.06^*	0.60 $\pm$ 0.06^*	0.42 $\pm$ 0.06^*	0.39 $\pm$ 0.09^*	0.67 $\pm$ 0.07^*	0.43 $\pm$ 0.10^*	0.24 $\pm$ 0.08	0.30 $\pm$ 0.13^*	0.20 $\pm$ 0.05^*

Table 6. Table 6: Models’ test-set performance at predicting diagnosis, management, and severity. Test-set AUROCs, AUPRs, and Brier scores are reported as averages and standard deviations across ten independent initializations. Bold indicates the best result; italics indicates the second best.

Model	Diagnosis			Management			Severity
Model	AUROC	AUPR	Brier	AUROC	AUPR	Brier	AUROC	AUPR	Brier
Random	0.50	0.75	0.25	0.50	0.47	0.25	0.50	0.23	0.25
Radiomics + RF	0.64 $\pm$ 0.02	0.82 $\pm$ 0.01	0.22 $\pm$ 0.00	0.65 $\pm$ 0.01	0.60 $\pm$ 0.02	0.24 $\pm$ 0.00	0.77 $\pm$ 0.02	0.58 $\pm$ 0.04	0.15 $\pm$ 0.00
ResNet-18	0.70 $\pm$ 0.07	0.88 $\pm$ 0.04	0.25 $\pm$ 0.08	0.69 $\pm$ 0.07	0.71 $\pm$ 0.08	0.27 $\pm$ 0.05	0.73 $\pm$ 0.10	0.52 $\pm$ 0.10	0.18 $\pm$ 0.04
CBM-seq	0.64 $\pm$ 0.06	0.84 $\pm$ 0.04	0.22 $\pm$ 0.02	0.68 $\pm$ 0.05	0.68 $\pm$ 0.05	0.23 $\pm$ 0.02	0.66 $\pm$ 0.06	0.41 $\pm$ 0.08	0.23 $\pm$ 0.04
CBM-joint	0.62 $\pm$ 0.04	0.83 $\pm$ 0.04	0.24 $\pm$ 0.02	0.66 $\pm$ 0.06	0.68 $\pm$ 0.04	0.23 $\pm$ 0.02	0.68 $\pm$ 0.06	0.44 $\pm$ 0.08	0.23 $\pm$ 0.02
MVBM-avg	0.76 $\pm$ 0.05	0.89 $\pm$ 0.04	0.22 $\pm$ 0.03	0.71 $\pm$ 0.04	0.69 $\pm$ 0.04	0.24 $\pm$ 0.02	0.71 $\pm$ 0.12	0.59 $\pm$ 0.11	0.20 $\pm$ 0.05
MVBM-LSTM	0.76 $\pm$ 0.04	0.91 $\pm$ 0.02	0.23 $\pm$ 0.02	0.67 $\pm$ 0.04	0.61 $\pm$ 0.04	0.23 $\pm$ 0.02	0.74 $\pm$ 0.13	0.58 $\pm$ 0.12	0.22 $\pm$ 0.07
MVCBM-seq-avg	0.67 $\pm$ 0.05	0.85 $\pm$ 0.05	0.23 $\pm$ 0.02	0.58 $\pm$ 0.05	0.62 $\pm$ 0.06	0.26 $\pm$ 0.02	0.75 $\pm$ 0.07	0.56 $\pm$ 0.12	0.23 $\pm$ 0.04
MVCBM-seq-LSTM	0.73 $\pm$ 0.03	0.89 $\pm$ 0.01	0.24 $\pm$ 0.04	0.57 $\pm$ 0.03	0.53 $\pm$ 0.04	0.26 $\pm$ 0.01	0.70 $\pm$ 0.11	0.48 $\pm$ 0.16	0.21 $\pm$ 0.03
MVCBM-joint-avg	0.66 $\pm$ 0.09	0.84 $\pm$ 0.06	0.24 $\pm$ 0.06	0.69 $\pm$ 0.06	0.66 $\pm$ 0.11	0.23 $\pm$ 0.02	0.70 $\pm$ 0.06	0.53 $\pm$ 0.11	0.24 $\pm$ 0.02
MVCBM-joint-LSTM	0.72 $\pm$ 0.02	0.88 $\pm$ 0.02	0.22 $\pm$ 0.01	0.57 $\pm$ 0.05	0.50 $\pm$ 0.04	0.26 $\pm$ 0.01	0.65 $\pm$ 0.07	0.37 $\pm$ 0.10	0.24 $\pm$ 0.02
SSMVCBM-avg	0.80 $\pm$ 0.03	0.92 $\pm$ 0.02	0.20 $\pm$ 0.03	0.72 $\pm$ 0.05	0.72 $\pm$ 0.04	0.27 $\pm$ 0.05	0.73 $\pm$ 0.07	0.57 $\pm$ 0.09	0.17 $\pm$ 0.02
SSMVCBM-LSTM	0.80 $\pm$ 0.06	0.92 $\pm$ 0.04	0.19 $\pm$ 0.04	0.70 $\pm$ 0.03	0.67 $\pm$ 0.06	0.27 $\pm$ 0.04	0.78 $\pm$ 0.05	0.58 $\pm$ 0.10	0.21 $\pm$ 0.10

Table 7. Table E.1: Summary of the MVCBM architectures used for the (a) synthetic and (b) MVAwA and pediatric appendicitis datasets. Here, B 𝐵 B denotes the batch size, V 𝑉 V the maximum number of views, w 𝑤 w and h ℎ h the width and height of the input image, K 𝐾 K the number of concepts, H 𝐻 H the number of units in the hidden layer of f 𝜽 ( ⋅ ) subscript 𝑓 𝜽 ⋅ f_{\boldsymbol{\theta}}(\cdot) , and N o subscript 𝑁 𝑜 N_{o} the number of output units.

Module	Layers	Input Dimensions	Output Dimensions
$𝒉_{𝝍} (\cdot)$	Linear	( $B$ , $V$ , 500)	( $B$ , $V$ , 256)
	Dropout(0.05)	( $B$ , $V$ , 256)	( $B$ , $V$ , 256)
	BatchNorm1d	( $B$ , $V$ , 256)	( $B$ , $V$ , 256)
	Linear	( $B$ , $V$ , 256)	( $B$ , $V$ , 256)
	Dropout(0.05)	( $B$ , $V$ , 256)	( $B$ , $V$ , 256)
	BatchNorm1d	( $B$ , $V$ , 256)	( $B$ , $V$ , 256)
	Linear	( $B$ , $V$ , 256)	( $B$ , $V$ , 256)
	Dropout(0.05)	( $B$ , $V$ , 256)	( $B$ , $V$ , 256)
	BatchNorm1d	( $B$ , $V$ , 256)	( $B$ , $V$ , 256)
	Linear	( $B$ , $V$ , 256)	( $B$ , $V$ , 128)
$𝒓_{𝝃} (\cdot)$	LSTM/mean	( $B$ , $V$ , 128)	( $B$ , 128)
$𝒔_{𝜻} (\cdot)$	Linear	( $B$ , 128)	( $B$ , 256)
	ReLu	( $B$ , 256)	( $B$ , 256)
	Linear	( $B$ , 256)	( $B$ , 64)
	ReLu	( $B$ , 64)	( $B$ , 64)
	Linear	( $B$ , 64)	( $B$ , $K$ )
	Sigmoid	( $B$ , $K$ )	( $B$ , $K$ )
$f_{𝜽} (\cdot)$	Linear	( $B$ , $K$ )	( $B$ , $H$ )
	ReLu	( $B$ , $H$ )	( $B$ , $H$ )
	Linear	( $B$ , $H$ )	( $B$ , 1)
	Sigmoid	( $B$ , 1)	( $B$ , 1)

Table 8. Table E.2: Final hyperparameter values used to train models on the synthetic nonlinear data . The meaning of the hyperparameters: E 𝒄 subscript 𝐸 𝒄 E_{\boldsymbol{c}} , the number of training epochs for the concept model; C 𝐶 C , the number of iterations in the adversarial training procedure for the SSMVCBM; E 𝒛 subscript 𝐸 𝒛 E_{\boldsymbol{z}} , the number of training epochs for the representation learning module; E a subscript 𝐸 𝑎 E_{a} , the number of training epochs for the adversary; E y subscript 𝐸 𝑦 E_{y} , the number of training epochs for the target model or the full model; η 𝒄 subscript 𝜂 𝒄 \eta_{\boldsymbol{c}} , the learning rate (LR) for the concept model; η 𝒛 subscript 𝜂 𝒛 \eta_{\boldsymbol{z}} , the LR for the representation learning module; η a subscript 𝜂 𝑎 \eta_{a} , the LR for the adversary; η y subscript 𝜂 𝑦 \eta_{y} , the LR for the target or the full model; B 𝐵 B , the mini-batch size; α 𝛼 \alpha , a parameter controlling the trade-off between target and concept prediction in the joint optimization; λ 𝜆 \lambda , the weight of the adversarial regularizer in the loss function of the SSMVCBM.

Model	Hyperparameter
Model	$E_{𝒄}$	$C$	$E_{𝒛}$	$E_{a}$	$E_{y}$	$η_{𝒄}$	$η_{𝒛}$	$η_{a}$	$η_{y}$	$B$	$α$	$λ$
MLP	—	—	—	—	150	—	—	—	1.0e-3	64	—	—
CBM-seq	100	—	—	—	50	1.0e-3	—	—	1.0e-3	64	—	—
CBM-joint	—	—	—	—	120	—	—-	—	1.0e-4	64	1.0	—
MVBM-avg	—	—	—	—	150	—	—	—	1.0e-3	64	—	—
MVBM-LSTM	—	—	—	—	150	—	—-	—	1.0e-3	64	—	—
MVCBM-seq-avg	100	—	—	—	50	1.0e-3	—	—	1.0e-3	64	—	—
MVCBM-seq-LSTM	100	—	—	—	50	1.0e-3	—	—	1.0e-3	64	—	—
MVCBM-joint-avg	—	—	—	—	120	—	—	—	1.0e-4	64	1.0	—
MVCBM-joint-LSTM	—	—	—	—	120	—	—	—	1.0e-4	64	1.0	—
SSMVCBM-avg	100	7	30	30	50	1.0e-3	1.0e-3	1.0e-3	1.0e-3	64	—	1.0e-2
SSMVCBM-LSTM	100	7	30	30	50	1.0e-3	1.0e-3	1.0e-3	1.0e-3	64	—	1.0e-2

Table 9. Table E.3: Final hyperparameter values used to train models on the multiview animals with attributes .

Model	Hyperparameter
Model	$E_{𝒄}$	$C$	$E_{𝒛}$	$E_{a}$	$E_{y}$	$η_{𝒄}$	$η_{𝒛}$	$η_{a}$	$η_{y}$	$B$	$α$	$λ$
ResNet-18	—	—	—	—	120	—	—	—	1.0e-4	64	—	—
CBM-seq	25	—	—	—	20	1.0e-4	—	—	1.0e-2	64	—
CBM-joint	—	—	—	—	120	—	—	—	1.0e-4	64	1.0	—
MVBM-avg	—	—	—	—	120	—	—	—	1.0e-4	64	—	—
MVBM-LSTM	—	—	—	—	120	—	—	—	1.0e-4	64	—	—
MVCBM-seq-avg	25	—	—	—	20	1.0e-4	—	—	1.0e-2	64	—	—
MVCBM-seq-LSTM	25	—	—	—	20	1.0e-4	—	—	1.0e-2	64	—	—
MVCBM-joint-avg	—	—	—	—	120	—	—	—	1.0e-4	64	1.0	—
MVCBM-joint-LSTM	—	—	—	—	120	—	—	—	1.0e-4	64	1.0	—
SSMVCBM-avg	25	7	15	10	20	1.0e-4	1.0e-4	1.0e-2	1.0e-2	64	—	1.0e-2
SSMVCBM-LSTM	25	7	15	10	20	1.0e-4	1.0e-4	1.0e-2	1.0e-2	64	—	1.0e-2

Table 10. Table E.4: Final hyperparameter values used to train models on the appendicitis data with the diagnosis as the target.

Model	Hyperparameter
Model	$E_{𝒄}$	$C$	$E_{𝒛}$	$E_{a}$	$E_{y}$	$η_{𝒄}$	$η_{𝒛}$	$η_{a}$	$η_{y}$	$B$	$α$	$λ$
ResNet-18	—	—	—	—	120	—	—	—	1.0e-4	4	—	—
CBM-seq	25	—	—	—	20	1.0e-4	—	—	1.0e-2	4	—	—
CBM-joint	—	—	—	—	120	—	—	—	1.0e-4	4	1.0	—
MVBM-avg	—	—	—	—	120	—	—	—	1.0e-4	4	—	—
MVBM-LSTM	—	—	—	—	50	—	—	—	1.0e-4	4	—	—
MVCBM-seq-avg	20	—	—	—	20	1.0e-4	—	—	1.0e-2	4	—	—
MVCBM-seq-LSTM	20	—	—	—	20	1.0e-4	—	—	1.0e-2	4	—	—
MVCBM-joint-avg	—	—	—	—	120	—	—	—	1.0e-4	4	1.0	—
MVCBM-joint-LSTM	—	—	—	—	40	—	—	—	1.0e-3	4	1.0	—
SSMVCBM-avg	20	7	15	10	20	1.0e-4	1.0e-4	1.0e-2	1.0e-2	8	—	1.0e-2
SSMVCBM-LSTM	20	7	15	10	20	1.0e-4	1.0e-4	1.0e-2	1.0e-2	8	—	1.0e-2

Table 11. Table E.5: Final hyperparameter values used to train models on the appendicitis data with the management as the target.

Model	Hyperparameter
Model	$E_{𝒄}$	$C$	$E_{𝒛}$	$E_{a}$	$E_{y}$	$η_{𝒄}$	$η_{𝒛}$	$η_{a}$	$η_{y}$	$B$	$α$	$λ$
ResNet-18	—	—	—	—	120	—	—	—	1.0e-4	4	—	—
CBM-seq	25	—	—	—	20	1.0e-4	—	—	1.0e-2	4	—	—
CBM-joint	—	—	—	—	120	—	—	—	1.0e-4	4	1.0	—
MVBM-avg	—	—	—	—	120	—	—	—	1.0e-4	4	—	—
MVBM-LSTM	—	—	—	—	50	—	—	—	1.0e-4	4	—	—
MVCBM-seq-avg	20	—	—	—	20	1.0e-4	—	—	1.0e-2	4	—	—
MVCBM-seq-LSTM	20	—	—	—	20	1.0e-4	—	—	1.0e-2	4	—	—
MVCBM-joint-avg	—	—	—	—	120	—	—	—	1.0e-4	4	1.0	—
MVCBM-joint-LSTM	—	—	—	—	120	—	—	—	1.0e-4	4	1.0	—
SSMVCBM-avg	20	7	15	10	20	1.0e-4	1.0e-4	1.0e-2	1.0e-2	8	—	1.0e-2
SSMVCBM-LSTM	20	7	15	10	20	1.0e-4	1.0e-4	1.0e-2	1.0e-2	8	—	1.0e-2

Table 12. Table E.6: Final hyperparameter values used to train models on the appendicitis data with the severity as the target.

Model	Hyperparameter
Model	$E_{𝒄}$	$C$	$E_{𝒛}$	$E_{a}$	$E_{y}$	$η_{𝒄}$	$η_{𝒛}$	$η_{a}$	$η_{y}$	$B$	$α$	$λ$
ResNet-18	—	—	—	—	120	—	—	—	1.0e-4	4	—	—
CBM-seq	25	—	—	—	20	1.0e-4	—	—	1.0e-2	4	—	—
CBM-joint	—	—	—	—	120	—	—	—	1.0e-4	4	1.0	—
MVBM-avg	—	—	—	—	120	—	—	—	1.0e-4	4	—	—
MVBM-LSTM	—	—	—	—	70	—	—	—	1.0e-4	4	—	—
MVCBM-seq-avg	30	—	—	—	40	1.0e-4	—	—	1.0e-3	4	—	—
MVCBM-seq-LSTM	30	—	—	—	40	1.0e-4	—	—	1.0e-3	4	—	—
MVCBM-joint-avg	—	—	—	—	100	—	—	—	1.0e-4	4	1.0	—
MVCBM-joint-LSTM	—	—	—	—	100	—	—	—	1.0e-4	4	1.0	—
SSMVCBM-avg	20	7	15	10	20	1.0e-4	1.0e-4	1.0e-2	1.0e-2	8	—	1.0e-2
SSMVCBM-LSTM	20	7	15	10	20	1.0e-4	1.0e-4	1.0e-2	1.0e-2	8	—	1.0e-2

Table 13. Table F.1: Target and concept prediction results for MVCBM and SSMVCBM models and several baselines under different optimization procedures and fusion functions on the MVAwA data with the full concept set. The performance is reported as averages and standard deviations of the AUROC across ten independent simulations. Herein, “seq” and “joint” denote sequential and joint optimization, respectively; whereas “avg” and “LSTM” stand for the averaging- and LSTM-based fusion. Bold indicates the best result; italics indicates the second best.

Model	AUROC
Model	Target	Concepts
Random	0.50	0.50
ResNet-18	0.85 $\pm$ 0.01	—
CBM-seq	0.81 $\pm$ 0.01	0.86 $\pm$ 0.01
CBM-joint	0.84 $\pm$ 0.01	0.85 $\pm$ 0.01
MVBM-avg	0.96 $\pm$ 0.00	—
MVBM-LSTM	0.95 $\pm$ 0.00	—
MVCBM-seq-avg	0.95 $\pm$ 0.01	0.97 $\pm$ 0.00
MVCBM-seq-LSTM	0.92 $\pm$ 0.01	0.95 $\pm$ 0.00
MVCBM-joint-avg	0.94 $\pm$ 0.01	0.96 $\pm$ 0.00
MVCBM-joint-LSTM	0.95 $\pm$ 0.00	0.96 $\pm$ 0.01
SSMVCBM-avg	0.94 $\pm$ 0.00	0.97 $\pm$ 0.00
SSMVCBM-LSTM	0.92 $\pm$ 0.01	0.95 $\pm$ 0.00

Table 14. Table F.2: Test-set performance of the multiview concept bottleneck models using the LSTM-based fusion. Models were evaluated on the test set with the views ordered chronologically, as in the training set, and after shuffling the views. AUROCs and AUPRs are reported as averages and standard deviations across ten independent initialization. Bold indicates the best result; italics indicates the second best.

Model	Shuffled?	Diagnosis		Management		Severity
Model	Shuffled?	AUROC	AUPR	AUROC	AUPR	AUROC	AUPR
Random	—	0.50	0.75	0.50	0.47	0.50	0.23
MVCBM-seq-LSTM	no	0.73 $\pm$ 0.03	0.89 $\pm$ 0.01	0.57 $\pm$ 0.03	0.53 $\pm$ 0.04	0.70 $\pm$ 0.11	0.48 $\pm$ 0.16
MVCBM-seq-LSTM	yes	0.69 $\pm$ 0.04	0.88 $\pm$ 0.02	0.56 $\pm$ 0.05	0.57 $\pm$ 0.08	0.55 $\pm$ 0.14	0.29 $\pm$ 0.11
\cdashline1-10 MVCBM-joint-LSTM	no	0.72 $\pm$ 0.02	0.88 $\pm$ 0.02	0.57 $\pm$ 0.05	0.50 $\pm$ 0.04	0.65 $\pm$ 0.07	0.37 $\pm$ 0.10
MVCBM-joint-LSTM	yes	0.73 $\pm$ 0.01	0.89 $\pm$ 0.01	0.52 $\pm$ 0.04	0.48 $\pm$ 0.05	0.47 $\pm$ 0.07	0.22 $\pm$ 0.07
SSMVCBM-LSTM	no	0.80 $\pm$ 0.06	0.92 $\pm$ 0.04	0.70 $\pm$ 0.03	0.67 $\pm$ 0.06	0.78 $\pm$ 0.05	0.58 $\pm$ 0.10
SSMVCBM-LSTM	yes	0.72 $\pm$ 0.05	0.88 $\pm$ 0.04	0.55 $\pm$ 0.05	0.51 $\pm$ 0.07	0.62 $\pm$ 0.10	0.36 $\pm$ 0.11

Table 15. Table F.3: Test-set false positive rates (FPR) at varying percentages of the true positive rate (TPR) on the pediatric appendicitis dataset with the diagnosis as the target variable. Results are reported as averages and standard deviations across ten independent initializations. Bold indicates the best result; italics indicates the second best. Lower FPRs are better.

Model	FPR at % TPR
Model	%	75	80	90	95	99
Random		0.75	0.80	0.90	0.95	0.99
Radiomics + RF		0.55 $\pm$ 0.05	0.63 $\pm$ 0.07	0.85 $\pm$ 0.05	0.94 $\pm$ 0.02	0.97 $\pm$ 0.03
ResNet-18		0.55 $\pm$ 0.18	0.70 $\pm$ 0.11	0.85 $\pm$ 0.13	0.91 $\pm$ 0.11	0.95 $\pm$ 0.08
CBM-seq		0.61 $\pm$ 0.08	0.69 $\pm$ 0.11	0.80 $\pm$ 0.08	0.87 $\pm$ 0.10	0.97 $\pm$ 0.05
CBM-joint		0.67 $\pm$ 0.06	0.73 $\pm$ 0.08	0.84 $\pm$ 0.09	0.94 $\pm$ 0.06	0.95 $\pm$ 0.06
MVBM-avg		0.40 $\pm$ 0.14	0.51 $\pm$ 0.16	0.74 $\pm$ 0.21	0.86 $\pm$ 0.12	0.95 $\pm$ 0.07
MVBM-LSTM		0.45 $\pm$ 0.09	0.53 $\pm$ 0.12	0.74 $\pm$ 0.14	0.85 $\pm$ 0.09	0.94 $\pm$ 0.09
MVCBM-seq-avg		0.58 $\pm$ 0.14	0.61 $\pm$ 0.10	0.77 $\pm$ 0.09	0.89 $\pm$ 0.09	0.93 $\pm$ 0.07
MVCBM-seq-LSTM		0.52 $\pm$ 0.04	0.55 $\pm$ 0.06	0.67 $\pm$ 0.11	0.75 $\pm$ 0.12	0.93 $\pm$ 0.09
MVCBM-joint-avg		0.57 $\pm$ 0.15	0.64 $\pm$ 0.10	0.77 $\pm$ 0.11	0.93 $\pm$ 0.09	0.95 $\pm$ 0.08
MVCBM-joint-LSTM		0.59 $\pm$ 0.10	0.65 $\pm$ 0.09	0.75 $\pm$ 0.11	0.84 $\pm$ 0.08	0.93 $\pm$ 0.08
SSMVCBM-avg		0.31 $\pm$ 0.11	0.40 $\pm$ 0.11	0.65 $\pm$ 0.14	0.78 $\pm$ 0.15	0.88 $\pm$ 0.14
SSMVCBM-LSTM		0.31 $\pm$ 0.16	0.38 $\pm$ 0.15	0.68 $\pm$ 0.11	0.84 $\pm$ 0.10	0.93 $\pm$ 0.07

Table 16. Table F.4: Models’ Brier scores for concept prediction on the appendicitis dataset across the three target variables. Test-set results are reported as averages and standard deviations across ten independent initializations. Herein, “seq” and “joint” denote sequential and joint optimization, respectively, whereas “avg” and “LSTM” stand for the averaging- and LSTM-based fusion. Results that are significantly lower than the score of the constant prediction of 0.5 (random) are marked by “ * ”. Bold indicates the best result; italics indicates the second best. The meaning of the concept variables: c 1 subscript 𝑐 1 c_{1} , visibility of the appendix; c 2 subscript 𝑐 2 c_{2} , free intraperitoneal fluid; c 3 subscript 𝑐 3 c_{3} , appendix layer structure; c 4 subscript 𝑐 4 c_{4} , target sign; c 5 subscript 𝑐 5 c_{5} , surrounding tissue reaction; c 6 subscript 𝑐 6 c_{6} , pathological lymph nodes; c 7 subscript 𝑐 7 c_{7} , thickening of the bowel wall; c 8 subscript 𝑐 8 c_{8} , coprostasis; c 9 subscript 𝑐 9 c_{9} , meteorism.

Target	Model	Concept
Target	Model	$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	$c_{8}$	$c_{9}$
Diagnosis	Random	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25
	CBM-seq	0.26 $\pm$ 0.02	0.31 $\pm$ 0.02	0.20 $\pm$ 0.02^*	0.21 $\pm$ 0.01^*	0.26 $\pm$ 0.02	0.23 $\pm$ 0.02^*	0.17 $\pm$ 0.02^*	0.14 $\pm$ 0.01^*	0.18 $\pm$ 0.02^*
	CBM-joint	0.30 $\pm$ 0.03	0.42 $\pm$ 0.02	0.18 $\pm$ 0.01^*	0.21 $\pm$ 0.02^*	0.32 $\pm$ 0.04	0.24 $\pm$ 0.03	0.16 $\pm$ 0.01^*	0.12 $\pm$ 0.01^*	0.15 $\pm$ 0.02^*
	MVCBM-seq-avg	0.26 $\pm$ 0.04	0.26 $\pm$ 0.01	0.21 $\pm$ 0.03^*	0.22 $\pm$ 0.02^*	0.28 $\pm$ 0.02	0.24 $\pm$ 0.03	0.23 $\pm$ 0.03	0.28 $\pm$ 0.03	0.26 $\pm$ 0.03
	MVCBM-seq-LSTM	0.17 $\pm$ 0.02^*	0.25 $\pm$ 0.01	0.19 $\pm$ 0.02^*	0.19 $\pm$ 0.01^*	0.25 $\pm$ 0.01	0.26 $\pm$ 0.01	0.21 $\pm$ 0.02^*	0.27 $\pm$ 0.02	0.25 $\pm$ 0.02
	MVCBM-joint-avg	0.29 $\pm$ 0.07	0.32 $\pm$ 0.04	0.22 $\pm$ 0.07	0.25 $\pm$ 0.06	0.32 $\pm$ 0.07	0.22 $\pm$ 0.03	0.18 $\pm$ 0.04^*	0.22 $\pm$ 0.16	0.23 $\pm$ 0.10
	MVCBM-joint-LSTM	0.15 $\pm$ 0.01^*	0.24 $\pm$ 0.00^*	0.22 $\pm$ 0.01^*	0.21 $\pm$ 0.01^*	0.22 $\pm$ 0.01^*	0.26 $\pm$ 0.00	0.22 $\pm$ 0.02^*	0.26 $\pm$ 0.01	0.24 $\pm$ 0.02
	SSMVCBM-avg	0.24 $\pm$ 0.04	0.26 $\pm$ 0.01	0.24 $\pm$ 0.09	0.22 $\pm$ 0.05	0.29 $\pm$ 0.02	0.23 $\pm$ 0.05	0.22 $\pm$ 0.06	0.34 $\pm$ 0.13	0.23 $\pm$ 0.04
	SSMVCBM-LSTM	0.16 $\pm$ 0.03^*	0.25 $\pm$ 0.01	0.22 $\pm$ 0.07	0.20 $\pm$ 0.03^*	0.25 $\pm$ 0.03	0.23 $\pm$ 0.04	0.22 $\pm$ 0.06	0.28 $\pm$ 0.06	0.22 $\pm$ 0.04
Management	Random	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25
	CBM-seq	0.28 $\pm$ 0.03	0.27 $\pm$ 0.02	0.20 $\pm$ 0.03^*	0.26 $\pm$ 0.05	0.25 $\pm$ 0.03	0.25 $\pm$ 0.03	0.21 $\pm$ 0.04	0.18 $\pm$ 0.05^*	0.22 $\pm$ 0.02^*
	CBM-joint	0.33 $\pm$ 0.05	0.39 $\pm$ 0.05	0.17 $\pm$ 0.03^*	0.26 $\pm$ 0.06	0.34 $\pm$ 0.04	0.28 $\pm$ 0.05	0.18 $\pm$ 0.03^*	0.14 $\pm$ 0.03^*	0.20 $\pm$ 0.05
	MVCBM-seq-avg	0.26 $\pm$ 0.04	0.26 $\pm$ 0.01	0.29 $\pm$ 0.06	0.26 $\pm$ 0.03	0.27 $\pm$ 0.01	0.22 $\pm$ 0.03	0.28 $\pm$ 0.04	0.27 $\pm$ 0.05	0.25 $\pm$ 0.03
	MVCBM-seq-LSTM	0.16 $\pm$ 0.02^*	0.25 $\pm$ 0.01	0.26 $\pm$ 0.03	0.24 $\pm$ 0.01	0.23 $\pm$ 0.01^*	0.23 $\pm$ 0.01^*	0.28 $\pm$ 0.02	0.25 $\pm$ 0.01	0.23 $\pm$ 0.02
	MVCBM-joint-avg	0.33 $\pm$ 0.13	0.28 $\pm$ 0.03	0.21 $\pm$ 0.05	0.24 $\pm$ 0.04	0.31 $\pm$ 0.05	0.24 $\pm$ 0.06	0.22 $\pm$ 0.06	0.19 $\pm$ 0.09	0.26 $\pm$ 0.14
	MVCBM-joint-LSTM	0.19 $\pm$ 0.03^*	0.32 $\pm$ 0.04	0.22 $\pm$ 0.07	0.22 $\pm$ 0.03^*	0.28 $\pm$ 0.03	0.23 $\pm$ 0.03	0.21 $\pm$ 0.04	0.16 $\pm$ 0.04^*	0.23 $\pm$ 0.06
	SSMVCBM-avg	0.21 $\pm$ 0.01^*	0.27 $\pm$ 0.02	0.34 $\pm$ 0.10	0.31 $\pm$ 0.05	0.28 $\pm$ 0.02	0.21 $\pm$ 0.02^*	0.26 $\pm$ 0.05	0.29 $\pm$ 0.07	0.19 $\pm$ 0.03^*
	SSMVCBM-LSTM	0.17 $\pm$ 0.02^*	0.26 $\pm$ 0.01	0.26 $\pm$ 0.05	0.24 $\pm$ 0.03	0.23 $\pm$ 0.02	0.22 $\pm$ 0.01^*	0.29 $\pm$ 0.04	0.24 $\pm$ 0.02	0.22 $\pm$ 0.03
Severity	Random	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25
	CBM-seq	0.30 $\pm$ 0.04	0.26 $\pm$ 0.02	0.23 $\pm$ 0.02	0.26 $\pm$ 0.05	0.27 $\pm$ 0.03	0.23 $\pm$ 0.03	0.23 $\pm$ 0.03	0.21 $\pm$ 0.02^*	0.23 $\pm$ 0.04
	CBM-joint	0.29 $\pm$ 0.05	0.40 $\pm$ 0.03	0.18 $\pm$ 0.04^*	0.27 $\pm$ 0.08	0.33 $\pm$ 0.05	0.30 $\pm$ 0.05	0.20 $\pm$ 0.04^*	0.15 $\pm$ 0.04^*	0.18 $\pm$ 0.04^*
	MVCBM-seq-avg	0.29 $\pm$ 0.04	0.26 $\pm$ 0.01	0.27 $\pm$ 0.07	0.27 $\pm$ 0.04	0.28 $\pm$ 0.02	0.22 $\pm$ 0.03	0.30 $\pm$ 0.08	0.28 $\pm$ 0.03	0.25 $\pm$ 0.03
	MVCBM-seq-LSTM	0.19 $\pm$ 0.02^*	0.26 $\pm$ 0.01	0.24 $\pm$ 0.04	0.24 $\pm$ 0.02	0.25 $\pm$ 0.01	0.23 $\pm$ 0.02	0.27 $\pm$ 0.03	0.26 $\pm$ 0.02	0.25 $\pm$ 0.02
	MVCBM-joint-avg	0.29 $\pm$ 0.03	0.28 $\pm$ 0.02	0.19 $\pm$ 0.05^*	0.23 $\pm$ 0.04	0.29 $\pm$ 0.03	0.24 $\pm$ 0.03	0.19 $\pm$ 0.05^*	0.17 $\pm$ 0.04^*	0.26 $\pm$ 0.05
	MVCBM-joint-LSTM	0.18 $\pm$ 0.02^*	0.28 $\pm$ 0.02	0.18 $\pm$ 0.03^*	0.21 $\pm$ 0.03	0.26 $\pm$ 0.03	0.25 $\pm$ 0.04	0.22 $\pm$ 0.04	0.20 $\pm$ 0.04	0.22 $\pm$ 0.03
	SSMVCBM-avg	0.29 $\pm$ 0.06	0.28 $\pm$ 0.03	0.34 $\pm$ 0.15	0.32 $\pm$ 0.10	0.31 $\pm$ 0.06	0.22 $\pm$ 0.02^*	0.33 $\pm$ 0.16	0.30 $\pm$ 0.12	0.22 $\pm$ 0.05
	SSMVCBM-LSTM	0.23 $\pm$ 0.09	0.26 $\pm$ 0.01	0.25 $\pm$ 0.06	0.26 $\pm$ 0.04	0.24 $\pm$ 0.02	0.22 $\pm$ 0.03	0.33 $\pm$ 0.12	0.26 $\pm$ 0.03	0.26 $\pm$ 0.05

Equations25

h_{i}^{v}

h_{i}^{v}

\overset{ˉ}{h}_{i}

\hat{c}_{i}

\overset{y}{^}_{i}

\hat{ϕ} = ar g ϕ min i = 1 \sum N k = 1 \sum K w_{i}^{t} w_{i}^{c_{k}} L^{c_{k}} (\overset{c}{^}_{i, k}, c_{i, k}),

\hat{ϕ} = ar g ϕ min i = 1 \sum N k = 1 \sum K w_{i}^{t} w_{i}^{c_{k}} L^{c_{k}} (\overset{c}{^}_{i, k}, c_{i, k}),

\hat{θ} = ar g θ min i = 1 \sum N w_{i}^{t} L^{t} (f_{θ} (\hat{c}_{i}), y_{i}),

\hat{θ} = ar g θ min i = 1 \sum N w_{i}^{t} L^{t} (f_{θ} (\hat{c}_{i}), y_{i}),

\hat{ϕ}, \hat{θ} =

\hat{ϕ}, \hat{θ} =

\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\alpha\sum_{i=1}^{N}\sum_{k=1}^{K}w^{t}_{i}w_{i}^{c_{k}}\mathcal{L}^{c_{k}}(\hat{c}_{i,k},c_{i,k})\bigg{\}},

\overset{y}{^}_{i}^{S} = f_{\hat{θ}} (\hat{c}_{{1, ..., K} ∖ S}, c_{S}),

\overset{y}{^}_{i}^{S} = f_{\hat{θ}} (\hat{c}_{{1, ..., K} ∖ S}, c_{S}),

h_{i}^{c, v}

h_{i}^{c, v}

h_{i}^{z, v}

\overset{ˉ}{h}_{i}^{c}

\overset{ˉ}{h}_{i}^{z}

\hat{c}_{i}

\hat{z}_{i}

\overset{y}{^}_{i}

\hat{ϕ}^{z}, \tilde{θ} =

\hat{ϕ}^{z}, \tilde{θ} =

λ i = 1 \sum N k = 1 \sum K w_{i}^{c_{k}} L^{c_{k}} ([a_{τ} (\hat{z}_{i})]_{k}, \overset{c}{^}_{i, k}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

i6092467/semi-supervised-multiview-cbm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\metadata

[]✉ Co-corresponding authors[email protected]

[email protected] \metadata[\authfn1]These authors have contributed equally to this work and share first authorship. \metadata[\authfn2]These authors have contributed equally to this work and share last authorship. \metadata[]Data availabilityData are available on Zenodo, the code is available on GitHub. \metadata[]Published versionMarcinkevičs, R., Reis Wolfertstetter, P., Klimiene, U., Chin-Cheong, K., Paschke, A., Zerres, J., … Vogt, J. E. (2024). Interpretable and intervenable ultrasonography-based machine learning models for pediatric appendicitis. Medical Image Analysis, 91, 103042. doi:10.1016/j.media.2023.103042 \leadauthorMarcinkevičs, Reis Wolfertstetter, & Klimiene

Interpretable and intervenable ultrasonography-based machine learning

models for pediatric appendicitis

Ričards Marcinkevičs

Patricia Reis Wolfertstetter

Ugne Klimiene

Kieran Chin-Cheong

Alyssia Paschke

Julia Zerres

Markus Denzinger

David Niederberger

Sven Wellmann

Division of Neonatology, Hospital St. Hedwig of the Order of St. John of God, University Children’s Hospital Regensburg (KUNO)

Ece Ozkan

Christian Knorr

Julia E. Vogt

0000-0002-9998-0058

Abstract

Appendicitis is among the most frequent reasons for pediatric abdominal surgeries. Previous decision support systems for appendicitis have focused on clinical, laboratory, scoring, and computed tomography data and have ignored abdominal ultrasound, despite its noninvasive nature and widespread availability. In this work, we present interpretable machine learning models for predicting the diagnosis, management and severity of suspected appendicitis using ultrasound images. Our approach utilizes concept bottleneck models (CBM) that facilitate interpretation and interaction with high-level concepts understandable to clinicians. Furthermore, we extend CBMs to prediction problems with multiple views and incomplete concept sets. Our models were trained on a dataset comprising 579 pediatric patients with 1709 ultrasound images accompanied by clinical and laboratory data. Results show that our proposed method enables clinicians to utilize a human-understandable and intervenable predictive model without compromising performance or requiring time-consuming image annotation when deployed. For predicting the diagnosis, the extended multiview CBM attained an AUROC of 0.80 and an AUPR of 0.92, performing comparably to similar black-box neural networks trained and tested on the same dataset.

1 Introduction

Appendicitis is one of the most frequent causes of abdominal pain resulting in hospital admissions of patients under 18 [130]. The diagnosis can be challenging and relies on a combination of clinical, laboratory and imaging parameters [118]. Despite extensive research, no specific and practically useful biomarkers for the early detection of appendicitis have been identified [69, 96]. Epidemiologically and clinically, there are two forms of appendicitis: uncomplicated (subacute/exudative, phlegmonous) and complicated (gangrenous, perforated) [73, 76, 96]. Management forms include surgery as the standard method [118, 89] or conservative therapy [73, 123, 124, 89, 79].

Typical imaging modalities for suspected pediatric appendicitis include ultrasonography (US), magnetic resonance imaging (MRI), and computed tomography (CT). US has become the primary choice due to widespread availability, lack of radiation, and improvements in resolution over the past years [109]. Repeated US examinations, including B(rightness)-mode and Doppler, during the observation phase can improve diagnostic accuracy and help identify disease progression [83, 108, 89].

Extensive research has been conducted on utilizing machine learning (ML) models to diagnose and manage patients with suspected appendicitis [93, 82, 114, 74, 71, 121, 112, 103, 115, 131]. In brief, most models either utilize simple clinical and laboratory data [93, 74, 71, 131], rely on hand-crafted US annotations [114, 121, 103, 115], or require more expensive and invasive imaging modalities, such as CT [112]. Despite having lower sensitivity and specificity than CT, US has been advocated as the preferred imaging modality for diagnosing acute appendicitis due to the absence of ionizing radiation and cost-effectiveness [107]. Although promising and practical, fully automated analysis of abdominal US images in this context remains an under-explored approach.

US imaging gives natural rise to multiview and multimodal data [129, 111]. For instance, the risk of breast cancer may be assessed based on multiview and multimodal US images of lesions. More generally, multiview learning [133] concerns itself with the data comprising multiple views, essentially feature subsets, of the same source object. Additionally, multimodal learning [75] studies models combining, or fusing, multiple heterogeneous modalities, e.g. images and text. Both research directions have experienced renewed interest in the light of contrastive and self-supervised learning [126, 128] and generative modeling [122].

Interpretable machine learning has emerged as an active research direction, [84, 116], with interpretability argued to be an essential model design principle for high-stakes application domains, such as healthcare. One recently re-explored approach is prediction based on high-level and human-understandable concepts [99, 100, 97] or attributes. Most frameworks for concept-based prediction require auxiliary supervision in the form of high-level semantic features during training. Typically, two models are trained, as, for instance, in concept bottleneck models (CBM) [97]: (i) one mapping from the explanatory variables to the given concepts and (ii) another predicting the target variable based on the previously predicted concept values. Such concept-based models are deemed interpretable since concepts can be inspected alongside the final model outputs and perceived as “explanations”. Additionally, as opposed to classical multitask learning, a human user can intervene and interact with the model at test time by editing concept predictions and affecting downstream output. Beyond the restricted supervised setting mentioned earlier, there have been several efforts to learn semantically meaningful and identifiable representations when the concepts are not given explicitly [94, 125].

This work presents the first effort at leveraging ML to predict diagnosis, management, and severity in pediatric patients with suspected appendicitis directly from abdominal US images, an imaging modality frequently used in daily clinical practice. To this end, our models utilize interpretable concept-based classification approach due to its potential acceptance among clinicians and investigate the trade-off between interpretability and predictive performance. Furthermore, we propose extensions of the concept bottleneck models [97] to improve their scalability to real-world medical imaging data, contributing to the recent works identifying and addressing the limitations of concept-based models [102, 105, 119, 104]. Specifically, we extend conventional CBMs (i) to the multiview classification setting and (ii) propose a semi-supervised representation learning approach to overcome the limitations of incomplete concept sets, i.e. when the given set of concepts does not capture the entire predictive relationship between the images and labels, making it challenging to achieve high predictive performance. The presented generalization of the CBMs to multiple views and incomplete concept sets is summarized in Figure 1. It is not restricted to the considered use case of pediatric appendicitis and ultrasound and can be applied to other multiview and multimodal medical imaging datasets.

2 Materials and Methods

2.1 Dataset

In our retrospective analysis, we examined data from a cohort of 579 children and adolescents (aged 0–18 years) admitted as inpatients to the Department of Pediatric Surgery and Pediatric Orthopedics at the tertiary Children’s Hospital St. Hedwig in Regensburg, Germany between January 1, 2016, and December 31, 2021, with suspected appendicitis. Our study builds and expands upon the previous analysis of a smaller cohort of patients, published by [103].

We utilized the hospital’s database to collect retrospective data, including (potentially) multiple abdominal B-mode ultrasound images for each patient (totaling 1709 images). The number of views per subject ranges from 1 to 15; the images depict various regions of interest, such as the abdomen’s right lower quadrant (RLQ), appendix, intestines, lymph nodes, and reproductive organs (Figure 2). Ultrasound images from admission and, if available, initial clinical course were retrieved using the software Clinic WinData/E&L. For surgical patients, US images from the preoperative clinical course were also included. The images were acquired on Toshiba Xario and Aplio XG machines using Toshiba 6 MHz Convex and 12 MHz Linear transducers. For each subject, all images relevant to the findings from Table 2 were included. Images of the organs unrelated to appendicitis, such as the liver or spleen, were excluded from the dataset. We also retrieved information encompassing laboratory tests, physical examination results, clinical scores, such as Alvarado (AS) and pediatric appendicitis (PAS) scores [72, 117, 81]. AS and PAS were utilized due to the widespread use by pediatricians and pediatric surgeons for the risk stratification of children and adolescents with abdominal pain [83]. Last but not least, we collected expert-produced ultrasonographic findings represented by categorically-valued features. A subset of the latter was identified as high-level concepts relevant to decision support (Table 2). For patients treated operatively, surgical and histological parameters were recorded.

The subjects were labeled w.r.t. three target variables: (i) diagnosis (appendicitis vs. no appendicitis), (ii) management (surgical vs. conservative), and (iii) severity (complicated vs. uncomplicated or no appendicitis). The diagnosis was confirmed histologically in the patients who underwent appendectomy. Subjects treated conservatively were labeled as having appendicitis if their appendix diameter was at least 6 mm and either AS or PAS were at least 4. Note that the labeling criterion above is only a proxy for the ground-truth disease status. AS and PAS help exclude children with no appendicitis [81], whereas the addition of the US information on the enlarged appendix has been shown to increase the positive predictive value [87, 83]. This labeling criterion has already been utilized in the previous analyses of the data from an overlapping patient cohort [103, 115]. [103] present a more detailed exploration to justify it. The management label reflects the decision made by a senior pediatric surgeon based on clinical, laboratory and US data. For the severity, complicated appendicitis includes cases with abscess formation, gangrene, or perforation.

Note that the analysis below utilizes only ultrasound images and findings extracted from them. Our goal was to explore US image analysis and its benefits for predictive models for pediatric appendicitis. Nevertheless, we publicize the entire dataset, including modalities other than imaging. Tables 1 and A.1 provide an overview of the dataset used in the final analysis. Appendix A contains a more comprehensive description of the dataset and its acquisition.

2.1.1 Data Preprocessing

Prior to model development and evaluation, pre-processing was performed on B-mode ultrasound images to eliminate undesired variability. The study being retrospective, ultrasonograms were collected as per clinical routine, and therefore, original images contained graphical user interface elements, markers, distance measurements, and other annotations. We employed a generative inpainting model DeepFill [135], to mask and fill such objects. Subsequently, images were resized to 400 $\times$ 400 px2 dimensions using zero padding when needed. Finally, contrast-limited histogram equalization (CLAHE) was applied, and pixel intensities were normalized to the range of [math] and $1$ . Figure 2 shows an example of the multiple US views acquired from a single subject from our cohort before and after preprocessing.

2.2 Problem Setting and Notation

Throughout the remaining sections, we will assume the following setting and notation. Consider a dataset comprising $N$ triples $\left(\left\{\boldsymbol{x}_{i}^{v}\right\}_{v=1}^{V_{i}},\,\boldsymbol{c}_{i},\,y_{i}\right)$ , for $1\leq i\leq N$ , with view sequences $\left\{\boldsymbol{x}_{i}^{v}\right\}_{v=1}^{V_{i}}$ , concept vectors $\boldsymbol{c}_{i}\in\mathbb{R}^{K}$ provided at training time, and labels $y_{i}$ . Note that the number of views $V_{i}\geq 1$ may vary across data points $1\leq i\leq N$ . We will concentrate on the scenario where all views can be preprocessed and rescaled into the same dimensionality. Nevertheless, our approach can be extended to heterogeneous data types.

Motivated by medical imaging applications, we focus on the data exhibiting characteristics described informally below. (i) Partial observability: not all concepts are identifiable from all views. (ii) View homogeneity: most views contain a considerable amount of shared information and are visually similar. (iii) View ordering: views belonging to the same data point may be loosely ordered, e.g. spatially, temporally, or based on their importance for predicting the label. These properties are inspired by the multiview ultrasound dataset explored in our experiments and support some design choices described below.

2.3 Multiview Concept Bottleneck Models

Below, we present a novel approach that extends the concept bottleneck models [97] to the multiview classification scenario. We refer to this extension as the multiview concept bottleneck model (MVCBM) hereon. A schematic overview of the MVCBM architecture is shown in Figure 1, while the model’s forward pass is specified by Eqs. (1a)–(1d). In brief, MVCBM comprises four modules: (i) per-view feature extraction; (ii) feature fusion; (iii) concept prediction, and (iv) label prediction.

To address scenarios where the set of concepts provided is incomplete, aka insufficient, either due to the lack of domain knowledge or the cost of acquiring additional annotation, we have also developed a semi-supervised variant of the MVCBM, referred to as semi-supervised MVCBM (SSMVCBM). This approach not only utilizes the given concepts but also learns an independent representation predictive of the label. Note that this extension will be described in the later sections.

For data point $1\leq i\leq N$ , a forward pass of the multiview concept bottleneck is given by the following equations:

[TABLE]

where Latin letters correspond to functions and variables and Greek letters denote learnable parameters. Observe that parameters $\boldsymbol{\phi}=\left\{\boldsymbol{\psi},\boldsymbol{\xi},\boldsymbol{\zeta}\right\}$ define the concept model $\boldsymbol{g}_{\boldsymbol{\phi}}(\cdot)$ mapping a multiview feature sequence to the predicted concept values; whereas $f_{\boldsymbol{\theta}}(\cdot)$ is the target model, linking concepts and labels. Thus, similar to the vanilla concept bottleneck, MVCBM’s forward pass can be rewritten as $\hat{y}_{i}=f_{\boldsymbol{\theta}}\left(\boldsymbol{g}_{\boldsymbol{\phi}}\left(\left\{\boldsymbol{h}^{v}_{i}\right\}_{v=1}^{V_{i}}\right)\right)$ . In the following paragraphs, we detail each of the steps in Eq. (1).

Feature extraction

Given an ordered view sequence $\left\{\boldsymbol{x}_{i}^{v}\right\}_{v=1}^{V_{i}}$ , we first encode each view into a lower-dimensional representation, as in Eq. (1a). We employ a shared encoder neural network, denoted by $\boldsymbol{h}_{\boldsymbol{\psi}}(\cdot)$ . Weight sharing is justified by the view homogeneity and could be helpful in smaller datasets with high missingness of views. On the other hand, in multimodal datasets, the dissimilarities between images acquired from the same subject are significant and consistent. In this scenario, it may be preferable to train a dedicated encoder for each modality to learn modality-specific features. In practice, it may be prudent to use a pretrained model to initialize $\boldsymbol{h}_{\boldsymbol{\psi}}(\cdot)$ , e.g. the use of ResNet and VGG architectures pretrained on natural images is standard for medical imaging applications [78]. As a result, we obtain a sequence of view-specific features.

Feature fusion

To accommodate multiple views, we need to fuse, i.e. aggregate, the view-specific features within the model, as in Eq. (1b). MVCBM follows a hybrid fusion approach [75]: rather than concatenating views at the input level (early fusion) or training an ensemble of view-specific models (late fusion); we aggregate intermediate view-specific features $\boldsymbol{h}^{v}_{i}$ from the previous step within a single neural network. Although there are many viable fusion functions, in our context, the fusion must handle varying numbers of views per data point. As a naive approach, we consider arithmetic mean across the views $\boldsymbol{\bar{h}}_{i}=\frac{1}{V_{i}}\sum_{v=1}^{V_{i}}\boldsymbol{h}^{v}_{i}$ [90].

More generally, in Eq. (1b) $\boldsymbol{\bar{h}}_{i}$ denotes the fused feature vector and $\boldsymbol{r}_{\boldsymbol{\xi}}(\cdot)$ is the fusion function with parameters $\boldsymbol{\xi}$ . Considering partial observability of the concepts and ordering of the views, we, in addition, investigate aggregation via a learnable function. Similar to [101], who utilize this trick in multiview 3D shape recognition, we combine view-specific representations via a long short-term memory (LSTM) network. In particular, we set the aggregated representation $\boldsymbol{\bar{h}}_{i}$ to the last hidden state of the view sequence, i.e. at step $V_{i}$ . Note that both averaging and LSTM can handle varying numbers of views. Nevertheless, there are other options for $\boldsymbol{r}_{\boldsymbol{\xi}}(\cdot)$ , e.g. Hadamard product or weighted average, the investigation of which we leave for future work.

Concept and label prediction

The last two steps in Eqs. (1c)–(1d) are similar to the vanilla concept bottleneck. First, we predict concepts $\boldsymbol{\hat{c}}_{i}$ based on the fused representation $\boldsymbol{\bar{h}}_{i}$ , using a concept encoder network $\boldsymbol{s}_{\boldsymbol{\zeta}}(\cdot)$ parameterized by $\boldsymbol{\zeta}$ . Note that the choice of activation functions at the output of $\boldsymbol{s}_{\boldsymbol{\zeta}}(\cdot)$ depends on the type of concepts and should be adapted to whether an individual concept is categorically or continuously valued. The vector $\hat{\boldsymbol{c}}_{i}$ is then used as an input to the target model $f_{\boldsymbol{\theta}}(\cdot)$ , predicting the label $\hat{y}$ . The output activation should be chosen based on the downstream task, which can be, for example, classification or regression.

Loss function and optimization

The parameters of vanilla CBMs can be optimized using independent, sequential and joint procedures [97]. In this work, we focus on the sequential and joint approaches since they offer a more balanced trade-off between predictive performance and intervenability, as shown experimentally by [97].

In the sequential training, we first optimize the concept model parameters:

[TABLE]

where $\mathcal{L}^{c_{k}}(\cdot,\,\cdot)$ is the loss function for the $k$ -th concept, e.g. one could use the cross-entropy for categorically valued and squared error for a continuously valued concept, and $c_{i,k}$ refers to the value of the $k$ -th concept for the $i$ -th data point.

Additionally, to address potential imbalances in the concept distributions and sparsity of specific concept-target combinations, we have introduced weights $w_{i}^{c_{k}}$ for the $k$ -th concept and $w_{i}^{t}$ for the target variable of the $i$ -th point, s.t. $\sum_{i=1}^{N}\sum_{k=1}^{K}w_{i}^{c_{k}}=1$ and $\sum_{i=1}^{N}w_{i}^{t}=1$ . In practice, these weights can be set to the normalized inverse counts of samples in the corresponding variable classes, i.e. $w^{t}_{i}\propto 1/\sum_{j=1}^{N}\boldsymbol{1}_{\left\{y_{j}=y_{i}\right\}}$ and $w^{c_{k}}_{i}\propto 1/\sum_{j=1}^{N}\boldsymbol{1}_{\left\{c_{j,k}=c_{i,k}\right\}}$ , where $\boldsymbol{1}_{\{\cdot\}}$ is the indicator function. However, other sample weighting schemes are viable.

Next, parameters $\boldsymbol{\hat{\phi}}$ are frozen, and the parameters of the target model $f_{\boldsymbol{\theta}}$ are optimized:

[TABLE]

where $\mathcal{L}^{t}(\cdot,\,\cdot)$ is the loss function for the target task, and $\boldsymbol{\hat{c}}_{i}$ are predictions made by the frozen concept model $\boldsymbol{g}_{\boldsymbol{\hat{\phi}}}(\cdot)$ .

For the joint training, we combine the loss functions from Eqs. (2) and (3) into a single objective:

[TABLE]

where $\alpha>0$ controls the trade-off between target and concept predictive performance. Observe that parameters $\boldsymbol{\phi}$ and $\boldsymbol{\theta}$ are optimized simultaneously.

Intervenability

A salient difference between CBMs and multitask models is that a practitioner utilizing a CBM model can interact with it by intervening on concept predictions, e.g. “correcting” the model by setting the predicted values to the ground truth $\hat{c}_{i,k}\mathrel{\hbox to0.0pt{\raisebox{1.29167pt}{$ \cdot $}\hss}\raisebox{-1.29167pt}{$ \cdot $}}=c_{i,k}$ . In particular, for a data point $1\leq i\leq N$ , the updated prediction after the intervention on the concepts from a subset $\mathcal{S}\subseteq\left\{1,...,K\right\}$ is given by

[TABLE]

where $\boldsymbol{\hat{c}}$ and $\boldsymbol{c}$ refer to the predicted and ground truth concept vectors, respectively. Note the notation abuse in the order of the arguments in $f_{\boldsymbol{\hat{\theta}}}(\cdot)$ .

2.4 Semi-supervised Multiview Concept Bottleneck Models

As previously stated, the set of $K$ concepts given at the training may prove incomplete, owing to factors such as the high cost of annotation, the lack of knowledge, or ethical concerns regarding the measurement of certain variables. More formally, concept bottlenecks implicitly assume that concepts are a sufficient statistic for the target variable [134]; in other words, $\boldsymbol{x}\perp\!\!\!\perp y\,|\,\boldsymbol{c}$ . A situation where $\boldsymbol{x}\not{\perp\!\!\!\perp}y\,|\,\boldsymbol{c}$ may occur when some ground-truth concept variables are systematically missing in the acquired dataset, i.e. unobserved for all data points. Figure 3 depicts two data-generating mechanisms that may lead to the scenario described above. When this is the case, the predictive performance of the CBM is limited since the model solely relies on the predefined set of concepts which is insufficient. To address this limitation, we propose a semi-supervised variant of the MVCBM (Figure 1) that additionally learns representations complementary to the concepts and relevant to the downstream prediction task.

Next to the feature extraction and concept prediction, SSMVCBM includes an unsupervised module mapping views $\left\{\boldsymbol{x}_{i}^{v}\right\}^{V_{i}}_{v=1}$ to the representation $\boldsymbol{\hat{z}}_{i}\in\mathbb{R}^{J}$ . To predict the label, $\boldsymbol{\hat{c}}_{i}$ and $\boldsymbol{\hat{z}}_{i}$ are concatenated and fed into the target model. This variant of the model is semi-supervised in that the label is predicted based on both $\boldsymbol{\hat{c}}_{i}$ and $\boldsymbol{\hat{z}}_{i}$ , where $\boldsymbol{\hat{c}}_{i}$ are supervised by the concept prediction loss (Equation 7), while $\boldsymbol{\hat{z}}_{i}$ are complementary representations learnt without concept labels. Representations $\boldsymbol{\hat{z}}_{i}$ are meant to capture the residual relationship between $\boldsymbol{x}$ and $\boldsymbol{y}$ not represented among the observed concepts $\boldsymbol{c}$ . A forward pass of the SSMVCBM is given by

[TABLE]

where variables and parameters superscripted by $\boldsymbol{c}$ and $\boldsymbol{z}$ correspond to the concept and representation learning modules, respectively.

To avoid learning a representation redundant to the concepts, it is desirable that $\boldsymbol{\hat{c}}\perp\!\!\!\perp\boldsymbol{\hat{z}}\,\rvert\,y$ , i.e. the predicted concepts and unsupervised representations should be statistically independent conditional on the label. Concretely, we use another neural network $\boldsymbol{a}_{\boldsymbol{\tau}}:\>\mathbb{R}^{J}\rightarrow\mathbb{R}^{K}$ , parameterized by weights $\boldsymbol{\tau}$ , to quantify the degree of statistical dependence as $\max_{\boldsymbol{\tau}}\mathrm{corr}\left(\boldsymbol{a}_{\boldsymbol{\tau}}\left(\boldsymbol{\hat{z}}\right),\,\boldsymbol{\hat{c}}\right)$ [70]. Thus, network $\boldsymbol{a}_{\tau}$ is used to adversarially regularize representation $\boldsymbol{\hat{z}}$ . Empirically, we observed that this regularization scheme helps de-correlate $\boldsymbol{\hat{z}}$ from concept predictions and improves the model’s intervenability (Appendix F.2). Additionally, note that, for the data-generating mechanisms shown in Figure 3, $\boldsymbol{\hat{z}}$ does not need to identify unobserved concepts $\boldsymbol{c}^{\prime}$ but rather represents the residual relationship between $\boldsymbol{x}$ and $y$ .

The procedure to train SSMVCBMs is outlined in Algorithm D.1. Similar to the sequential optimization for (MV)CBMs as in Eqs. (2) and (3), it consists of multiple steps. First, parameters $\boldsymbol{\phi}^{\boldsymbol{c}}=\left\{\boldsymbol{\psi}^{\boldsymbol{c}},\,\boldsymbol{\xi}^{\boldsymbol{c}},\,\boldsymbol{\zeta}^{\boldsymbol{c}}\right\}$ involved in concept prediction are optimized using the loss function analogous to Eq. (2). Then, we fix $\boldsymbol{\hat{\phi}}^{\boldsymbol{c}}$ and optimize parameters $\boldsymbol{\phi}^{\boldsymbol{z}}=\left\{\boldsymbol{\psi}^{\boldsymbol{z}},\,\boldsymbol{\xi}^{\boldsymbol{z}},\,\boldsymbol{\zeta}^{\boldsymbol{z}}\right\}$ by solving the following problem:

[TABLE]

where $\lambda>0$ is a tuning parameter corresponding to the weight of the adversarial regularizer. The loss function above can be extended with further regularization terms, e.g. to de-correlate individual dimensions of $\boldsymbol{\hat{z}}$ [80], facilitating a more straightforward interpretation. In practice, the minimax objective is optimized using adversarial training similarly to the generative adversarial networks [88]. Last but not least, parameters of the target model are re-optimized, cf. Eq. (3), treating $\boldsymbol{\hat{\phi}}^{\boldsymbol{c}}$ and $\boldsymbol{\hat{\phi}}^{\boldsymbol{z}}$ as fixed: $\boldsymbol{\hat{\theta}}=\arg\min_{\boldsymbol{\theta}}\sum_{i=1}^{N}w_{i}^{t}\mathcal{L}^{t}\left(f_{\boldsymbol{\theta}}\left(\left[\boldsymbol{\hat{c}_{i}},\boldsymbol{\hat{z}_{i}}\right]\right),y_{i}\right)$ .

3 Experiments and Results

The purpose of our experiments was twofold: (i) to present a proof of concept for the introduced extensions of the CBMs on simple benchmarks and (ii) to apply our techniques to a real-world medical imaging dataset. In the subsequent sections, we provide a more detailed overview of the experimental setup.

3.1 Experimental Setup

Datasets and validation scheme

To test the feasibility of the proposed concept-based multiview classification approaches, we conducted an initial experiment using a synthetic tabular nonlinear classification problem. The generative process of this dataset was defined directly based on the classical concept bottleneck model, involving (i) the sampling of a design matrix, (ii) the mapping of features to concepts, and (iii) the use of these concepts to construct labels. In addition, we constructed multiple “views”, each comprising a subset of the original feature set. This dataset is particularly suited to multiview approaches due to its inherent structure. Its essential advantage over the conventional benchmarks from the literature, such as the UCSD Birds, is the presence of reliable per-data-point concept labels. Additional details can be found in Appendix B. This problem features binary concepts that are identifiable from the given multiview observations. Although, herein, concept and target prediction are classification problems, all methods present are easily extendable to regression. In our experiments, we assessed the models’ performance at (i) target, (ii) concept prediction, and (iii) the effectiveness of interventions on the predicted concepts. Additionally, to explore the scenario where the set of concepts is incomplete, we purposefully trained the models on concept subsets of varying sizes. We compared the performance of our approach with that of single- and multiview black-box classifiers and the vanilla concept bottlenecks [97]. In addition to the tabular data, we constructed a semi-synthetic attribute-based natural image dataset based on the Animals with Attributes 2 [100, 132] (Appendix C). The experimental results for this benchmark are reported in Appendix F.1.

Last but not least, to demonstrate the effectiveness of our proposed methods on real-world data, we employed ultrasound imaging and tabular clinical, laboratory, and scoring data from pediatric patients with suspected appendicitis. We explored three different target variables encompassing the diagnosis, treatment assignment, and complications. A comprehensive overview of this dataset is available in the previous sections and in Appendix A. For model validation and comparison, we divided the data according to the 90%-10% train-test split. Hyperparameter tuning was performed only on the training set using five-fold cross-validation. The final hyperparameter values are reported in Tables E.2–E.6. The list of high-level concepts relevant to decision support for pediatric appendicitis can be found in Table 2. The selection criteria for these variables were the following: (i) the concept had to be detectable from ultrasound images, as confirmed by a qualified physician, and (ii) the variable had to had been collected preoperatively.

Ablations

We compared several variations of the proposed multiview concept bottlenecks to better understand the role of the design choices made. Specifically, we trained models using sequential (MVCBM-seq) and joint (MVCBM-joint) optimization procedures given by Eqs. (2)–(4). We also compared the semi-supervised extension (SSMVCBM) defined in Eq. (6) to the basic MVCBM. To facilitate meaningful comparison, we purposefully trained models under insufficient concept sets to observe if the SSMVCBM could achieve any performance improvement over the MVCBM. Furthermore, we investigated the impact of two fusion functions, namely, the arithmetic mean ((SS)MVCBM-avg) and LSTM ((SS)MVCBM-LSTM). Lastly, similar to [97], we explored interventions on the concept bottlenecks by replacing the predicted concept values with the ground truth at test time. The goal was to investigate whether a practitioner utilizing a concept-based model could improve its predictions interactively.

Baselines

We benchmarked the performance of the (SS)MVCBMs against several baselines. Across all datasets, we applied single-view neural-network-based classifiers. Specifically, we trained MLPs for tabular data and fine-tuned ResNet-18 [92] on images. As an interpretable single-view baseline, we employed vanilla CBMs. To ensure a fair comparison between CBMs and (SS)MVCBMs, we utilized identical architectures for individual modules. As a black-box multiview baseline, we employed a neural network with the same architecture as for the MVCBM but trained without concept supervision in the bottleneck layer, which we refer to as multiview bottleneck (MVBM). Similarly, as for its interpretable counterpart, we compared two ways of aggregating per-view representations: averaging and LSTM. Lastly, specific to the pediatric appendicitis dataset, in addition to deep-learning- and concept-based approaches, we also investigated an alternative baseline predictive model: a random forest (RF) [77] fitted on radiomic features [127]. The features were extracted from every image and averaged across the views for each subject.

Evaluation

Since the intended use case of our models in healthcare applications is decision support rather than decision-making, we mainly focused on evaluating the performance of concept and label predictions using areas under receiver operating characteristic (AUROC) and precision-recall (AUPR) curves. Notably, for pediatric appendicitis, different metrics may be relevant depending on the target variable, e.g. a low false negative rate may be critical for diagnosis and severity, while a low false positive rate may be desirable for management to avert negative appendectomies [98]. Furthermore, for appendicitis, we also assessed the predictions’ calibration using the Brier score.

Implementation details

We implemented MVCBM and SSMVCBM in PyTorch (v 1.11.0) [110]. Across all experiments and models, when applicable, we fine-tuned pretrained ResNet-18 [92] as the shared view encoder. For the concept encoder and target model, we utilized MLPs with ReLU hidden activations. Detailed architecture specifications are provided in Appendix E.

We used the PyRadiomics package [127] for radiomic feature extraction. Features were extracted from the whole images without prior segmentation of the region of interest since segmentation is beyond the scope of the current work. We computed first-order statistics, gray level size zone and gray level run length matrix features from the original and square-filtered images. Random forests were trained with a cost-sensitive loss function to account for class imbalance. ANOVA $F$ -value-based feature selection was performed using nested cross-validation to improve the performance of this baseline further. The remainder of the implementation details can be found in Appendix E and within the publicly available code and documentation.

3.2 Proof of Concept on Synthetic Data

The first benchmark we considered was tabular synthetic nonlinear data. Figure 4 contains the summary of the results. As expected, black-box and concept-based multiview approaches are consistently more accurate than their single-view counterparts at target (Figure 4(a)) and concept prediction (Figure 4(b)). Namely, a multiview bottleneck model without concept supervision (MVBM) performs considerably better than a multilayer perceptron trained on a single view (MLP) (paired $t$ -test $p$ -value $<0.0001$ for target AUROC); similarly, a multiview concept bottleneck (MVCBM) outperforms a simple CBM (for all numbers of concepts given, $p$ -value $<0.05$ for target and concept AUROC). Notably, the target prediction accuracy for CBM and MVCBM increases with the number of concepts given, as shown in Figure 4(a). When almost a complete concept set is provided, the performance of the multiview CBM becomes closer to that of the multiview black-box classifier. The semi-supervised MVCBM (SSMVCBM) performs well even when very few concepts are known and is close to the black-box baseline in most settings (for at least 5/30 concepts given, $p$ -value $>0.05$ for target AUROC).

For the concept prediction, MVCBM and SSMVCBM attain comparable performance with higher AUROCs than the single-view model (Figure 4(b)). As expected, the semi-supervised model predicts the concepts equally well compared to the MVCBM (for all numbers of concepts given, $p$ -value $>0.05$ for concept AUROC); thus, representation learning has no effect on the concept prediction. Lastly, we observe from Figure 4(c) that similarly to the classical CBM, both multiview variants are intervenable, i.e. their predictive performance improves when replacing predicted concepts with the ground truth at test time.

In addition to the results above, Appendix F.1 describes experiments on a semi-synthetic attribute-based natural image dataset. In brief, we observed similar results to the ones reported in Figure 4. In Appendix F.2, we explore the SSMVCBM in more detail, performing an ablation study on the effect of adversarial regularization.

3.3 Application to Pediatric Appendicitis

Our multiview concept bottleneck models are readily applicable to medical imaging datasets, which, in practice, often include multiple views and heterogeneous data types. In the following, we explore the application of the multiview CBMs to the pediatric appendicitis dataset.

Predicting high-level ultrasound features

We first evaluated the ability of all concept-based models to predict high-level appendix ultrasound features (Table 2) from (multiple) abdominal US images. Table 3 contains test-set AUROCs and AUPRs achieved by the different variants of the concept bottleneck. In addition to comparing vanilla CBMs to their multiview and semi-supervised extensions, we investigated the effect of the optimization procedure, sequential vs. joint, and view-specific feature fusion, averaging vs. long short-term memory (LSTM). The models included in Table 3 were trained to predict the diagnosis (appendicitis vs. no appendicitis); however, we observed similar results for the management and severity, as shown in Tables 4–5. Minor discrepancies across the three classification problems are attributable to the differences in the weights assigned to data points in the cost-sensitive loss function (Eqs. (2)–(4) and (7)) and the choice of hyperparameter values (Tables E.2–E.6).

Across all target variables, most concepts could be predicted by at least one of the models significantly better than by a fair coin flip (one-sample two-sided $t$ -test $p$ -value $<0.05$ , adjusted using the Benjamini–Yekutieli procedure with the FDR of $q=0.05$ ). Surprisingly, some of the variables with relatively few cases present in the dataset could be captured by some models, e.g. coprostasis ( $c_{8}$ ) and meteorism ( $c_{9}$ ) by the LSTM-based variants of MVCBM and SSMVCBM. On the other hand, the thickening of the bowel wall ( $c_{7}$ ) was particularly challenging to model, likely due to its low prevalence and the lack of predictive power in the downstream classification task: some models trained with the severity as the target were able to perform significantly better than random, as shown in Table 5.

Note that, in a few cases, some models achieved average AUROCs below the expected performance of a fair coin flip (Table 3), e.g. both sequentially and jointly optimized CBMs attained an AUROC close to 0.40 for predicting meteorism. Such performance is attributable to the sparsity of some concept variables; for instance, only 15% of subjects had a positive label for meteorism (Table 2). Another factor is the use of weighted loss functions for the concept and target prediction (Eqs. (2)–(4) and (7)). Consequently, the models may over-predict the minority class and perform worse than a fair coin flip.

Predictably, sequentially optimized models (seq) were more performant at the concept prediction than the ones optimized jointly (joint), in agreement with the findings reported in the literature [97]. Similar to the experiments on the synthetic data shown in Figure 4(b), the models aggregating multiple views tended to have higher AUROCs and AUPRs. However, by contrast, LSTM-based aggregation consistently and noticeably outperformed simple averaging (avg), especially for predicting the visibility of the appendix—one of the most important diagnostic concepts [103]. This could be associated with the loose spatiotemporal ordering among the US images acquired for each subject. Last but not least, semi-supervised bottlenecks were comparable to the sequentially optimized MVCBMs. Thus, learning complementary representations disentangled from the concepts did not hurt the model’s performance at concept prediction.

In addition to the discriminative power, we assessed the calibration of the concept predictions. The test-set Brier scores across the three targets are reported in Appendix F.5, Table F.4. Overall, similar to the findings above, multiview models attained lower Brier scores for most concept variables than the single-view CBMs. The cases wherein single-view CBMs performed better than their multiview counterparts may be attributed to the imbalances in concept distributions and the fact that the Brier score does not adjust for such situations. For instance, for very sparse response variables, a classifier trivially predicting the most frequent category would achieve a relatively low Brier score. Although many models predicted several concepts significantly better than the constant prediction of 0.5, their Brier scores were mainly in the range of 0.18-0.23, which is not considerably below the baseline of 0.25.

Predicting diagnosis, management, and severity

As mentioned, the end goal of the developed models was the prediction of the (i) diagnosis, (ii) management, and (iii) severity among suspected appendicitis patients based on the multiview US images. Test-set performance for these three target variables is reported in Table 6.

With respect to AUROC and AUPR, all models were able to predict all target variables better than the naive baseline. Among the concept-based approaches, multiview models offered a consistent improvement over the vanilla CBM for diagnosis and severity. Moreover, the best-performing concept-based classifiers often achieved AUROCs and AUPRs comparable to those of the black-box MVBM. For the diagnosis, on average, multiview concept bottlenecks with the LSTM-based fusion outperformed averaging-based approaches. However, for management, the opposite was true. Expectedly, while the LSTM-based fusion was helpful in the pediatric appendicitis dataset where US images are chronologically ordered, at test time, the target prediction performance of the LSTM-based CBMs was sensitive to the order of input images, as observed in the supplementary experimental results in Appendix F.3. For the diagnosis and management prediction, we also observed that neural-network-based methods, overall, outperformed RFs fitted on radiomics features. The latter result is not surprising, given that we did not utilize manually segmented regions of interest for radiomics feature extraction. Lastly, across all targets, the semi-supervised extension of the MVCBM achieved higher AUROCs and AUPRs or was comparable to the approaches that purely relied on the concepts.

Brier score results partially agree with AUROCs and AUPRs; however, they feature less variability across model classes. For all target variables, most scores are $\geq 0.20$ . Combined with the reported AUROCs and AUPRs, the latter finding indicates that the probabilistic predictions of the models considered could benefit from calibration, which could help produce more interpretable probabilistic outputs.

Along with the model comparison w.r.t. AUROCs, AUPRs, and Brier scores, we investigated the tradeoff between true positive (TPR) and false positive (FPR) rates in more detail for predicting the diagnosis. Full results are reported in Table F.3 (Appendix F.4). In particular, we assessed the models’ FPRs for a few fixed satisfactory levels of the TPR. As expected, we observed that, for all approaches, attaining high TPRs led to relatively high FPRs of $>30\%$ .

In summary, concept-based classification on multiview US data is encouragingly effective at predicting the diagnosis. For management, aggregating multiple US images offers no improvement over simple single-view classification. We attribute this to the diagnostic nature of the chosen concepts and their limited predictive power for the treatment assignment. Likewise, accurately predicting appendicitis severity is challenging, likely, due to the low prevalence of complicated appendicitis cases in the current dataset. Last but not least, in all tasks, the proposed SSMVCBM mitigated the poorer discriminative performance of concept-based approaches by learning representations complementary to the probably incomplete concept set.

Interacting with the model

The practical utility of CBMs lies in the ability of the human user, in the current use case, the physician, to intervene on the concepts predicted by the model, thus affecting the model’s behavior at test time. Similarly to the proof-of-concept experiments, we intervened on the bottleneck layers of the CBM, MVCBM, and SSMVCBM trained on the pediatric appendicitis data. Figure 5 summarizes these results. Since LSTM-based and sequentially trained classifiers generally captured the concepts better (Table 3), we only considered this specific configuration. Figure 5 shows the effect of interventions on the three models for the diagnosis (Figures 5(a) and 5(d)), management (Figures 5(b) and 5(e)), and severity (Figures 5(c) and 5(f)). The lines show changes in AUROCs and AUPRs when intervening on randomly chosen concept subsets of varying sizes.

For the diagnosis, the intervention effect is similar to the behavior of the models on the synthetic data shown in Figure 4(c). Namely, AUROC and AUPR increase steadily with the number of concepts intervened on: for all models, the maximum median AUROC and AUPR achieved are approx. 0.85 and 0.94, respectively. Being the best-performing model (Table 6), SSMVCBM demonstrates only a slight increase in median predictive performance after intervening on the full concept set.

Similarly, for management, we observed an increase in AUROC and AUPR. However, for predicting this target, a single-view CBM performed surprisingly well and overtook multiview models after interventions. Last but not least, interventions yielded no visible performance improvement for severity, possibly, due to considerable variance across initializations and randomly sampled concept subsets.

3.4 Online Prediction Tool

As a first step towards enabling clinicians and other interested parties to benefit from ML-based decision support, we developed and published an online decision support tool based on the abovementioned methods, available at https://papt.inf.ethz.ch/mvcbm. The use case is illustrated in Figure 6. The tool utilizes the multiview CBM model (Figure 1) for predicting the diagnosis in suspected appendicitis patients. The user may upload several ultrasonography images, each representing a different view of the same patient. Image preprocessing, described in the Methods section and demonstrated in Figure 2, may be optionally executed. In addition to predicting the diagnosis, the tool allows the user to intervene on the concept predictions (Table 2) by setting corresponding sigmoid activations to 0 (negative) or 1 (positive). Uploaded images are protected using server-side sessions, which are only temporarily stored on the server and are purged after 30 minutes. See Appendix G for more information.

4 Discussion

Most of the prior work on using ML for appendicitis has focused on tabular datasets with handcrafted features [93, 82, 114, 74, 71, 121, 103, 115, 131] or more invasive imaging modalities, such as computed tomography [112]. This work takes the first step towards the computer-aided diagnosis of appendicitis based on abdominal ultrasound, a noninvasive, accessible, and cheap technique. Moreover, to facilitate the replication of our results and allow for comparison with new methods, we made our anonymized dataset publicly accessible. It includes laboratory, physical exam, clinical, and US data from 579 patients. In addition, for demonstratory purposes, we deployed the MVCBM model for the diagnosis as an easy- and free-to-use web tool.

Although appendicitis is a common condition in the pediatric population, diagnosing it and choosing the best therapeutic option is challenging. Early differentiation between simple and complicated, necrotizing appendicitis is crucial for effective management and prognosis [113, 114, 96]. The advances in US resolution, especially with the high-frequency sonography, support the detection of a normal appendix and the identification of indirect appendicitis signs, such as surrounding tissue inflammation and the reaction of the intestinal bowel wall [109]. ML-based decision support tools may further increase diagnostic accuracy and prove pivotal in improving treatment outcomes. The results of the current study are promising, as they suggest that direct interpretation of US images by ML models is a feasible goal. Predictive models, such as the ones developed in this study, may assist physicians in interpreting acquired US images and may even enable comparison of the results with the newly conducted US exams to characterize the progress or resolution of the inflammation.

Moreover, this work presents an improvement upon traditional concept bottleneck models [97], making them more readily applicable to medical imaging datasets where multiple images or modalities may be observable for each subject. In order to accomplish this, we proposed a practical architecture based on the hybrid fusion approach [75], which can effectively handle varying numbers of views per data point, partial observability of the concepts from individual images, and the incorporation of spatial or temporal ordering. While prior research has explored the use of averaging and LSTM techniques for aggregating representations [90, 101], our focus is specifically on interpretable models, particularly those involving concept-based classification. To the best of our knowledge, this problem setting has not been previously discussed in the literature despite its relevance to biomedical applications [129, 111].

Another scenario that we studied, similarly pertinent to applications, is when the concept set given to a CBM is insufficient [134], i.e. does not entirely capture the predictive relationship between the covariates and the target. To address this issue and improve the CBM’s predictive performance, our model learns additional representations complementary to the concepts, i.e. de-correlated from the concepts yet helpful in the downstream prediction problem. To achieve this objective, we modified the model’s architecture, incorporated an adversarial regularization term into the loss function, and adapted the training procedure accordingly.

A few previous works have investigated related limitations of the CBMs when the concept set provided to the CBM proves insufficient, and have explored alternative model designs. For instance, [119] combined CBMs with self-explaining neural networks to learn additional unsupervised concepts; however, they did not investigate the disentanglement of the given and learned concepts or the intervenability of their extended bottleneck layer. [136] proposed fitting a concept bottleneck post hoc for a pretrained backbone and utilized residual fitting to compensate for an incomplete concept set. Moreover, they investigated the global model edition, e.g. to mitigate the classifier’s reliance on spurious correlation. In contrast, our work assumes an ante hoc modeling scenario and focuses on the local, i.e. single-data-point, interventions. Another related line of research also studied the problem of unobserved concepts and concept leakage [104], employing generative representation learning, which may be challenging to apply to smaller datasets in practice. The most closely related is the concurrent work by [91], who extended the standard CBM architecture with a side channel to learn latent concepts and compensate for insufficiency. While their method is similar to ours, it does not address multiview learning or consider medical imaging data.

In our experiments, we have demonstrated the feasibility of the proposed models and the benefits of the multiview and semi-supervised concept-based approach on synthetic and medical image data. Our findings have shown that the MVCBM and SSMVCBM models have generally outperformed vanilla CBM in terms of both concept and target prediction. Moreover, based on the US data, we have developed predictive models for appendicitis, its severity and the management of pediatric patients with abdominal pain (Tables 3–6). Our results suggest that, for the diagnosis, multiview concept bottlenecks can achieve comparable performance to black-box models while allowing medical practitioners to interpret and intervene on the predictions. For management and severity, we observed somewhat inconclusive results with little difference across the single- and multiview classifiers. We attribute the latter to the limited predictive power of the ultrasonographic findings for these targets [103], the diagnostic nature of the chosen concepts and the overall moderate size of the training set. For instance, it had been previously shown that the most important predictor of the treatment assignment is peritonitis/abdominal guarding [103] assessed during a clinical examination. Among the US findings, most other predictively useful attributes can be identified based on the RLQ image alone. Therefore, we hypothesize that the additional views, e.g. depicting pathological lymph nodes or meteorism, are not as helpful for the management classification. This observation might explain the relatively worse performance of the multiview approaches for this target variable.

Nevertheless, the current study exhibits certain limitations with regard to its design, experimental setup, and proposed methods. The appendicitis dataset represents a moderately-sized and relatively homogeneous patient cohort recruited from a single clinical center over a short time (between 2016 and 2021). Hence, in order to further validate predictive models, an external validation is necessary using data from diverse US devices, clinical centers, and countries. Another limitation is the lack of histologically confirmed diagnoses among the conservatively treated patients. This implies that the model validation and comparison results presented above must be interpreted cautiously since we do not have access to the true disease status for all subjects. The image preprocessing pipeline could be improved further: currently, we discard scale information in the US images, making it impossible to detect the appendix diameter, a relevant sonographic sign of appendicitis [113]. Lastly, concepts could be modeled in a more fine-grained manner to incorporate physicians’ uncertainty. Instead of just differentiating between the lack or presence of a finding, intermediate concept categories could be included by, for example, collecting data from multiple raters and considering discrepancies among them.

From the methodological perspective, we currently have a limited theoretical understanding of the (SS)MVCBMs. In particular, it would be desirable to explore the representations learned by SSMVCBMs and the identifiability of the ground-truth generative factors. Moreover, in the current implementation, it is not trivial to interpret the representations; thus, additional regularization may be necessary, such as rendering these representations disentangled.

Another potential improvement would be adopting a probabilistic approach to the concept and target variable prediction, facilitating more principled uncertainty estimation. As evidenced by the experiments, our predictive models could benefit from calibration. Explicit uncertainty modeling would allow for better-calibrated and more interpretable probabilistic predictions that could be utilized downstream to perform selective classification [85] and uncertainty-based concept interventions [120]. In practice, uncertainty in concept predictions could be modeled by adapting the proposed architecture with the modules from the stochastic segmentation networks [106] or probabilistic concept bottlenecks [95].

5 Conclusion and Outlook

Motivated by the demand for model interpretability in biomedical applications, we investigated the use of concept bottleneck models for predicting the diagnosis, management and severity among pediatric patients with suspected appendicitis, leveraging abdominal ultrasound images. The densely annotated dataset used to develop the predictive models was made publicly available, and one of the models was deployed as a freely available demo web tool (https://papt.inf.ethz.ch/mvcbm). Methodologically, we introduced several enhancements to the conventional concept-based classification approach. Our proposed models can handle multiple views of the object of interest and insufficient concept sets. Overall, our experimental results suggest that the proposed methods can deliver competitive performance, while offering an alternative to black-box deep learning models and allowing for real-time interaction with the end user.

In future work, we aim to address several limitations outlined above. We plan to validate the predictive models externally on the data from a hospital located in another country. Various model design alterations, such as other choices of learnable fusion, further regularization of the learned representations, and uncertainty quantification, are also to be considered. Moreover, we recognize the significance of extending our investigation beyond the retrospective study. For instance, it would be interesting to explore the use of active learning to decide on the acquisition of US images and concept labels for each subject. From the clinical perspective, developed models should be extended to incorporate clinical and laboratory parameters and consider other conditions, such as COVID-19, during appendicitis. Additionally, we anticipate that using more refined definitions of the target variables could provide more insightful results, e.g. differentiating between subacute and acute appendicitis for the diagnosis and predicting the risk of secondary appendectomy for the management. Adjustments in the model architecture and the acquisition of a larger training dataset will facilitate the incorporation of the color Doppler images in the analysis, potentially making the prediction of the disease severity progression more accurate.

Data availability

The anonymized data are available on Zenodo at https://doi.org/10.5281/zenodo.7711412, and the code can be found in a GitHub repository at https://github.com/i6092467/semi-supervised-multiview-cbm.

References

[69] Amish Acharya, Sheraz R. Markar, Melody Ni and George B. Hanna

“Biomarkers of acute appendicitis: systematic review and cost–benefit trade-off analysis”

In Surgical Endoscopy 31.3

Springer ScienceBusiness Media LLC, 2016, pp. 1022–1031

DOI: 10.1007/s00464-016-5109-1

[70] Ehsan Adeli, Qingyu Zhao, Adolf Pfefferbaum, Edith V. Sullivan, Li Fei-Fei, Juan Carlos Niebles and Kilian M. Pohl

“Representation Learning with Statistical Independence to Mitigate Bias”

In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV)

Waikoloa, HI, USA: IEEE, 2021

DOI: 10.1109/wacv48630.2021.00256

[71] Omer F. Akmese, Gul Dogan, Hakan Kor, Hasan Erbay and Emre Demir

“The Use of Machine Learning Approaches for the Diagnosis of Acute Appendicitis”

In Emergency Medicine International 2020

Hindawi Limited, 2020, pp. 1–8

DOI: 10.1155/2020/7306435

[72] Alfredo Alvarado

“A practical score for the early diagnosis of acute appendicitis”

In Annals of Emergency Medicine 15.5, 1986, pp. 557–564

DOI: 10.1016/S0196-0644(86)80993-3

[73] Roland E. Andersson

“The Natural History and Traditional Management of Appendicitis Revisited: Spontaneous Resolution and Predominance of Prehospital Perforations Imply That a Correct Diagnosis is More Important Than an Early Diagnosis”

In World Journal of Surgery 31.1

Springer ScienceBusiness Media LLC, 2006, pp. 86–92

DOI: 10.1007/s00268-006-0056-y

[74] Emrah Aydin, İnan Utku Türkmen, Gözde Namli, Çiğdem Öztürk, Ayşe B. Esen, Y. Eray, Egemen Eroğlu and Fatih Akova

“A novel and simple machine learning algorithm for preoperative diagnosis of acute appendicitis in children”

In Pediatric Surgery International 36.6

Springer ScienceBusiness Media LLC, 2020, pp. 735–742

DOI: 10.1007/s00383-020-04655-7

[75] Tadas Baltrušaitis, Chaitanya Ahuja and Louis-Philippe Morency

“Multimodal Machine Learning: A Survey and Taxonomy”

In IEEE Transactions on Pattern Analysis and Machine Intelligence 41.2

Institute of ElectricalElectronics Engineers (IEEE), 2019, pp. 423–443

DOI: 10.1109/TPAMI.2018.2798607

[76] Aneel Bhangu, Kjetil Søreide, Salomone Di Saverio, Jeanette Hansson Assarsson and Frederick Thurston Drake

“Acute appendicitis: modern understanding of pathogenesis, diagnosis, and management”

In The Lancet 386.10000

Elsevier BV, 2015, pp. 1278–1287

DOI: 10.1016/s0140-6736(15)00275-5

[77] Leo Breiman

“Random Forests”

In Machine Learning 45.1

Springer ScienceBusiness Media LLC, 2001, pp. 5–32

DOI: 10.1023/a:1010933404324

[78] Veronika Cheplygina

“Cats or CAT scans: Transfer learning from natural or medical image source data sets?”

In Current Opinion in Biomedical Engineering 9

Elsevier BV, 2019, pp. 21–27

DOI: 10.1016/j.cobme.2018.12.005

[79] CODA Collaborative

“A Randomized Trial Comparing Antibiotics with Appendectomy for Appendicitis”

In New England Journal of Medicine 383.20

Massachusetts Medical Society, 2020, pp. 1907–1919

DOI: 10.1056/nejmoa2014320

[80] Michael Cogswell, Faruk Ahmed, Ross B. Girshick, Larry Zitnick and Dhruv Batra

“Reducing Overfitting in Deep Networks by Decorrelating Representations”

In 4th International Conference on Learning Representations, ICLR 2016, 2016

DOI: 10.48550/arXiv.1511.06068

[81] Collaborative RSGobotWMR

“Appendicitis risk prediction models in children presenting with right iliac fossa pain (RIFT study): a prospective, multicentre validation study”

In The Lancet Child & Adolescent Health 4.4, 2020, pp. 271–280

DOI: 10.1016/S2352-4642(20)30006-7

[82] Louise Deleger, Holly Brodzinski, Haijun Zhai, Qi Li, Todd Lingren, Eric S Kirkendall, Evaline Alessandrini and Imre Solti

“Developing and evaluating an automated appendicitis risk stratification algorithm for pediatric patients in the emergency department”

In Journal of the American Medical Informatics Association 20.e2

Oxford University Press (OUP), 2013, pp. e212–e220

DOI: 10.1136/amiajnl-2013-001962

[83] Jens Dingemann and Benno Ure

“Imaging and the Use of Scores for the Diagnosis of Appendicitis in Children”

In European Journal of Pediatric Surgery 22.03

Georg Thieme Verlag KG, 2012, pp. 195–200

DOI: 10.1055/s-0032-1320017

[84] Finale Doshi-Velez and Been Kim

“Towards A Rigorous Science of Interpretable Machine Learning” arXiv:1702.08608, 2017

DOI: 10.48550/arXiv.1702.08608

[85] Yonatan Geifman and Ran El-Yaniv

“Selective Classification for Deep Neural Networks”

In Advances in Neural Information Processing Systems 30

Curran Associates, Inc., 2017, pp. 4885–4894

URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/4a8423d5e91fda00bb7e46540e2b0cf1-Paper.pdf

[86] Dan Geiger, Thomas Verma and Judea Pearl

“d-Separation: From Theorems to Algorithms”

In Uncertainty in Artificial Intelligence 10, Machine Intelligence and Pattern Recognition

North-Holland, 1990, pp. 139–148

DOI: 10.1016/B978-0-444-88738-2.50018-X

[87] I. Gendel, M. Gutermacher, G. Buklan, L. Lazar, D. Kidron, H. Paran and I. Erez

“Relative Value of Clinical, Laboratory and Imaging Tools in Diagnosing Pediatric Acute Appendicitis”

In European Journal of Pediatric Surgery 21.4

Georg Thieme Verlag KG, 2011, pp. 229–233

DOI: 10.1055/s-0031-1273702

[88] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville and Yoshua Bengio

“Generative Adversarial Networks”

In Communications of the ACM 63.11

New York, NY, USA: Association for Computing Machinery, 2020, pp. 139–144

DOI: 10.1145/3422622

[89] Ramon R. Gorter et al.

“Diagnosis and management of acute appendicitis. EAES consensus development conference 2015”

In Surgical Endoscopy 30.11

Springer ScienceBusiness Media LLC, 2016, pp. 4668–4690

DOI: 10.1007/s00464-016-5245-7

[90] Mohammad Havaei, Nicolas Guizard, Nicolas Chapados and Yoshua Bengio

“HeMIS: Hetero-Modal Image Segmentation”

In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016

Athens, Greece: Springer International Publishing, 2016, pp. 469–477

DOI: 10.1007/978-3-319-46723-8_54

[91] Marton Havasi, Sonali Parbhoo and Finale Doshi-Velez

“Addressing Leakage in Concept Bottleneck Models”

In Advances in Neural Information Processing Systems, 2022

URL: https://openreview.net/forum?id=tglniD_fn9

[92] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun

“Deep Residual Learning for Image Recognition”

In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Las Vegas, NV, USA: IEEE, 2016

DOI: 10.1109/cvpr.2016.90

[93] Chung-Ho Hsieh, Ruey-Hwa Lu, Nai-Hsin Lee, Wen-Ta Chiu, Min-Huei Hsu and Yu-Chuan (Jack) Li

“Novel solutions for an old disease: Diagnosis of acute appendicitis with random forest, support vector machines, and artificial neural networks”

In Surgery 149.1

Elsevier BV, 2011, pp. 87–93

DOI: 10.1016/j.surg.2010.03.023

[94] Ilyes Khemakhem, Diederik Kingma, Ricardo Monti and Aapo Hyvarinen

“Variational Autoencoders and Nonlinear ICA: A Unifying Framework”

In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics 108, Proceedings of Machine Learning Research

Virtual: PMLR, 2020, pp. 2207–2217

[95] Eunji Kim, Dahuin Jung, Sangha Park, Siwon Kim and Sungroh Yoon

“Probabilistic Concept Bottleneck Models”

In Proceedings of the 40th International Conference on Machine Learning 202, Proceedings of Machine Learning Research

PMLR, 2023, pp. 16521–16540

URL: https://proceedings.mlr.press/v202/kim23g.html

[96] N Kiss, M Minderjahn, J Reismann, J Svensson, T Wester, K Hauptmann, M Schad, J Kallarackal, H Bernuth and M Reismann

“Use of gene expression profiling to identify candidate genes for pretherapeutic patient classification in acute appendicitis”

In BJS Open 5.1

Oxford University Press (OUP), 2021

DOI: 10.1093/bjsopen/zraa045

[97] Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim and Percy Liang

“Concept Bottleneck Models”

In Proceedings of the 37th International Conference on Machine Learning 119, Proceedings of Machine Learning Research

Virtual: PMLR, 2020, pp. 5338–5348

[98] Marius Kryzauskas, Donatas Danys, Tomas Poskus, Saulius Mikalauskas, Eligijus Poskus, Valdemaras Jotautas, Virgilijus Beisa and Kestutis Strupas

“Is acute appendicitis still misdiagnosed?”

In Open Medicine 11.1

Walter de Gruyter GmbH, 2016, pp. 231–236

DOI: 10.1515/med-2016-0045

[99] Neeraj Kumar, Alexander C. Berg, Peter N. Belhumeur and Shree K. Nayar

“Attribute and simile classifiers for face verification”

In 2009 IEEE 12th International Conference on Computer Vision

Kyoto, Japan: IEEE, 2009, pp. 365–372

DOI: 10.1109/ICCV.2009.5459250

[100] Christoph H. Lampert, Hannes Nickisch and Stefan Harmeling

“Learning to detect unseen object classes by between-class attribute transfer”

In 2009 IEEE Conference on Computer Vision and Pattern Recognition

Miami, FL, USA: IEEE, 2009

DOI: 10.1109/CVPR.2009.5206594

[101] Chao Ma, Yulan Guo, Jungang Yang and Wei An

“Learning Multi-View Representation With LSTM for 3-D Shape Recognition and Retrieval”

In IEEE Transactions on Multimedia 21.5

IEEE, 2019, pp. 1169–1182

DOI: 10.1109/TMM.2018.2875512

[102] Anita Mahinpei, Justin Clark, Isaac Lage, Finale Doshi-Velez and Weiwei Pan

“Promises and Pitfalls of Black-Box Concept Learning Models” arXiv:2106.13314, 2021

DOI: 10.48550/arXiv.2106.13314

[103] Ricards Marcinkevics, Patricia Reis Wolfertstetter, Sven Wellmann, Christian Knorr and Julia E. Vogt

“Using Machine Learning to Predict the Diagnosis, Management and Severity of Pediatric Appendicitis”

In Frontiers in Pediatrics 9

Frontiers Media SA, 2021

DOI: 10.3389/fped.2021.662183

[104] Emanuele Marconato, Andrea Passerini and Stefano Teso

“GlanceNets: Interpretable, Leak-proof Concept-based Models” arXiv:2205.15612, 2022

DOI: 10.48550/arXiv.2205.15612

[105] Andrei Margeloiu, Matthew Ashman, Umang Bhatt, Yanzhi Chen, Mateja Jamnik and Adrian Weller

“Do Concept Bottleneck Models Learn as Intended?” arXiv:2105.04289, 2021

DOI: 10.48550/arXiv.2105.04289

[106] Miguel Monteiro, Loic Le Folgoc, Daniel Coelho de Castro, Nick Pawlowski, Bernardo Marques, Konstantinos Kamnitsas, Mark Wilk and Ben Glocker

“Stochastic Segmentation Networks: Modelling Spatially Correlated Aleatoric Uncertainty”

In Advances in Neural Information Processing Systems 33

Curran Associates, Inc., 2020, pp. 12756–12767

URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/95f8d9901ca8878e291552f001f67692-Paper.pdf

[107] Gerhard Mostbeck, E. Adam, Michael Bachmann Nielsen, Michel Claudon, Dirk Clevert, Carlos Nicolau, Christiane Nyhsen and Catherine M. Owens

“How to diagnose acute appendicitis: ultrasound first”

In Insights into Imaging 7.2

Springer ScienceBusiness Media LLC, 2016, pp. 255–263

DOI: 10.1007/s13244-016-0469-6

[108] Go Ohba, Seiichi Hirobe and Koji Komori

“The Usefulness of Combined B Mode and Doppler Ultrasonography to Guide Treatment of Appendicitis”

In European Journal of Pediatric Surgery 26.06

Georg Thieme Verlag KG, 2016, pp. 533–536

DOI: 10.1055/s-0035-1570756

[109] Noh Hyuck Park, Hwa Eun Oh, Hee Jin Park and Ji Yeon Park

“Ultrasonography of normal and abnormal appendix in children”

In World Journal of Radiology 3.4

Baishideng Publishing Group Inc., 2011, pp. 85–91

DOI: 10.4329/wjr.v3.i4.85

[110] Adam Paszke et al.

“PyTorch: An Imperative Style, High-Performance Deep Learning Library”

In Advances in Neural Information Processing Systems 32

Red Hook, NY, United States: Curran Associates, Inc., 2019

[111] Xuejun Qian, Jing Pei, Hui Zheng, Xinxin Xie, Lin Yan, Hao Zhang, Chunguang Han, Xiang Gao, Hanqi Zhang, Weiwei Zheng, Qiang Sun, Lu Lu and K. Shung

“Prospective assessment of breast cancer risk from multimodal multiview ultrasound images via clinically applicable deep learning”

In Nature Biomedical Engineering 5.6

Springer ScienceBusiness Media LLC, 2021, pp. 522–532

DOI: 10.1038/s41551-021-00711-2

[112] Pranav Rajpurkar, Allison Park, Jeremy Irvin, Chris Chute, Michael Bereket, Domenico Mastrodicasa, Curtis P. Langlotz, Matthew P. Lungren, Andrew Y. Ng and Bhavik N. Patel

“AppendiXNet: Deep Learning for Diagnosis of Appendicitis from A Small Dataset of CT Exams Using Video Pretraining”

In Scientific Reports 10.1

Springer ScienceBusiness Media LLC, 2020

DOI: 10.1038/s41598-020-61055-6

[113] Tristan Reddan, Jonathan Corness, Kerrie Mengersen and Fiona Harden

“Ultrasound of paediatric appendicitis and its secondary sonographic signs: providing a more meaningful finding”

In Journal of Medical Radiation Sciences 63.1

Wiley, 2016, pp. 59–66

DOI: 10.1002/jmrs.154

[114] Josephine Reismann, Alessandro Romualdi, Natalie Kiss, Maximiliane I. Minderjahn, Jim Kallarackal, Martina Schad and Marc Reismann

“Diagnosis and classification of pediatric acute appendicitis by artificial intelligence methods: An investigator-independent approach”

In PLoS ONE 14.9

Public Library of Science (PLoS), 2019, pp. e0222030

DOI: 10.1371/journal.pone.0222030

[115] Pedro Roig Aparicio, Ricards Marcinkevics, Patricia Reis Wolfertstetter, Sven Wellmann, Christian Knorr and Julia E. Vogt

“Learning Medical Risk Scores for Pediatric Appendicitis”

In 20th IEEE International Conference on Machine Learning and Applications (ICMLA)

Pasadena, CA, USA: IEEE, 2021

DOI: 10.1109/ICMLA52953.2021.00243

[116] Cynthia Rudin

“Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead”

In Nature Machine Intelligence 1.5

Springer ScienceBusiness Media LLC, 2019, pp. 206–215

DOI: 10.1038/s42256-019-0048-x

[117] Madan Samuel

“Pediatric appendicitis score”

In Journal of Pediatric Surgery 37.6, 2002, pp. 877–881

DOI: 10.1053/jpsu.2002.32893

[118] Salomone Di Saverio et al.

“WSES Jerusalem guidelines for diagnosis and treatment of acute appendicitis”

In World Journal of Emergency Surgery 11.1

Springer ScienceBusiness Media LLC, 2016

DOI: 10.1186/s13017-016-0090-5

[119] Yoshihide Sawada and Keigo Nakamura

“Concept Bottleneck Model With Additional Unsupervised Concepts”

In IEEE Access 10

IEEE, 2022, pp. 41758–41765

DOI: 10.1109/ACCESS.2022.3167702

[120] Sungbin Shin, Yohan Jo, Sungsoo Ahn and Namhoon Lee

“A Closer Look at the Intervention Procedure of Concept Bottleneck Models”

In Proceedings of the 40th International Conference on Machine Learning 202, Proceedings of Machine Learning Research

PMLR, 2023, pp. 31504–31520

URL: https://proceedings.mlr.press/v202/shin23a.html

[121] Carolin Stiel, Julia Elrod, Michaela Klinke, Jochen Herrmann, Carl-Martin Junge, Tarik Ghadban, Konrad Reinshagen and Michael Boettcher

“The Modified Heidelberg and the AI Appendicitis Score Are Superior to Current Scores in Predicting Appendicitis in Children: A Two-Center Cohort Study”

In Frontiers in Pediatrics 8

Frontiers Media SA, 2020

DOI: 10.3389/fped.2020.592892

[122] Masahiro Suzuki and Yutaka Matsuo

“A survey of multimodal deep generative models”

In Advanced Robotics 36.5-6

Informa UK Limited, 2022, pp. 261–278

DOI: 10.1080/01691864.2022.2035253

[123] J. Svensson, N. Hall, S. Eaton, A. Pierro and T. Wester

“A Review of Conservative Treatment of Acute Appendicitis”

In European Journal of Pediatric Surgery 22.03

Georg Thieme Verlag KG, 2012, pp. 185–194

DOI: 10.1055/s-0032-1320014

[124] Jan F. Svensson, Barbora Patkova, Markus Almström, Hussein Naji, Nigel J. Hall, Simon Eaton, Agostino Pierro and Tomas Wester

“Nonoperative Treatment With Antibiotics Versus Surgery for Acute Nonperforated Appendicitis in Children”

In Annals of Surgery 261.1

Ovid Technologies (Wolters Kluwer Health), 2015, pp. 67–71

DOI: 10.1097/sla.0000000000000835

[125] Armeen Taeb, Nicolo Ruggeri, Carina Schnuck and Fanny Yang

“Provable concept learning for interpretable predictions using variational autoencoders” arXiv:2204.00492, 2022

DOI: 10.48550/arXiv.2204.00492

[126] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid and Phillip Isola

“What Makes for Good Views for Contrastive Learning?”

In Advances in Neural Information Processing Systems 33

Red Hook, NY, United States: Curran Associates, Inc., 2020, pp. 6827–6839

[127] Joost J.M. Griethuysen, Andriy Fedorov, Chintan Parmar, Ahmed Hosny, Nicole Aucoin, Vivek Narayan, Regina G.H. Beets-Tan, Jean-Christophe Fillion-Robin, Steve Pieper and Hugo J.W.L. Aerts

“Computational Radiomics System to Decode the Radiographic Phenotype”

In Cancer Research 77.21

American Association for Cancer Research (AACR), 2017, pp. e104–e107

DOI: 10.1158/0008-5472.can-17-0339

[128] Julius Kügelgen, Yash Sharma, Luigi Gresele, Wieland Brendel, Bernhard Schölkopf, Michel Besserve and Francesco Locatello

“Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style”

In Advances in Neural Information Processing Systems 34

Red Hook, NY, United States: Curran Associates, Inc., 2021, pp. 16451–16467

[129] Yi Wang, Eun Jung Choi, Younhee Choi, Hao Zhang, Gong Yong Jin and Seok-Bum Ko

“Breast Cancer Classification in Automated Breast Ultrasound Using Multiview Convolutional Neural Network with Transfer Learning”

In Ultrasound in Medicine & Biology 46.5

Elsevier BV, 2020, pp. 1119–1132

DOI: 10.1016/j.ultrasmedbio.2020.01.001

[130] Lauren M Wier, Hao Yu, Pamela L Owens and Raynard Washington

“Overview of Children in the Emergency Department, 2010: Statistical Brief #157”

Rockville, MD, USA: Agency for Healthcare ResearchQuality, 2013

[131] Jianfu Xia, Zhifei Wang, Daqing Yang, Rizeng Li, Guoxi Liang, Huiling Chen, Ali Asghar Heidari, Hamza Turabieh, Majdi Mafarja and Zhifang Pan

“Performance optimization of support vector machine with oppositional grasshopper optimization for acute appendicitis diagnosis”

In Computers in Biology and Medicine 143

Elsevier BV, 2022, pp. 105206

DOI: 10.1016/j.compbiomed.2021.105206

[132] Yongqin Xian, Christoph H. Lampert, Bernt Schiele and Zeynep Akata

“Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly”

In IEEE Transactions on Pattern Analysis and Machine Intelligence 41.9

IEEE, 2019, pp. 2251–2265

DOI: 10.1109/tpami.2018.2857768

[133] Chang Xu, Dacheng Tao and Chao Xu

“A Survey on Multi-view Learning” arXiv:1304.5634, 2013

DOI: 10.48550/arXiv.1304.5634

[134] Chih-Kuan Yeh, Been Kim, Sercan Arik, Chun-Liang Li, Tomas Pfister and Pradeep Ravikumar

“On Completeness-aware Concept-Based Explanations in Deep Neural Networks”

In Advances in Neural Information Processing Systems 33

Vancouver, Canada: Curran Associates, Inc., 2020, pp. 20554–20565

[135] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu and Thomas S. Huang

“Generative Image Inpainting with Contextual Attention”

In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Salt Lake City, UT, USA: IEEE, 2018

DOI: 10.1109/cvpr.2018.00577

[136] Mert Yuksekgonul, Maggie Wang and James Zou

“Post-hoc Concept Bottleneck Models” arXiv:2205.15480, 2022

DOI: 10.48550/arXiv.2205.15480

Acknowledgments

We thank the members of the Medical Data Science group at ETH Zurich for stimulating discussions and feedback. Our gratitude goes to Dr. Johanna Joe and Dr. Markus Ebert from the Ultrasonography Center, KUNO, St. Hedwig Clinic Regensburg, for their help with the data acquisition. We are also sincerely thankful to Marcel Buehring and the IT Service Group of the Department of Computer Science, ETH Zurich, for their assistance with the deployment of the online prediction tool. RM was supported by the SNSF grant #320038189096, EO was supported by the SNSF grant P500PT-206746. This preprint was created using the LaPreprint template111https://github.com/roaldarbol/lapreprint by Mikkel Roald-Arbøl .

Competing interests

The authors declare no competing interests.

Supplementary Material

Appendix A Pediatric Appendicitis Dataset

The study was approved by the Ethics Committee of the University of Regensburg (no. 18-1063-101, 18-1063_1-101, and 18-1063_2-101) and was performed following applicable guidelines and regulations. The ethics committee confirmed that there was no need for written informed consent for the retrospective analysis and publication of anonymized routine data according to Art. 27 para. 4 of the Bavarian Hospital Law. For patients followed up after discharge, written informed consent was obtained from parents or legal representatives.

As mentioned, this study presents a retrospective analysis. The patients included in the cohort were managed according to the procedure summarized in Figure A.2. For the concept variables in Table 2, missing entries were imputed with the negative findings. A comprehensive dataset summary with detailed variable explanations is available at http://bit.ly/3SoA5E5.

Appendix B Synthetic Tabular Nonlinear Dataset

To compare different variants of the proposed model and baselines in a controlled manner, we designed a simple synthetic multiview dataset with nonlinear relationships between the covariates, concepts and labels. Let $N$ , $p$ , $V$ , and $K$ denote the number of data points, covariates per view, views, and concepts respectively. Below we outline the generative process behind the data:

Let $\boldsymbol{\mu}\in\mathbb{R}^{pV}$ be a randomly drawn vector where each component $\mu_{j}\sim\mathrm{Uniform}\left(-5,\,5\right)$ for $1\leq j\leq pV$ . 2. 2.

Let $\boldsymbol{\Sigma}\in\mathbb{R}^{pV\times pV}$ be a randomly generated symmetric, positive-definite matrix. 3. 3.

Let $\boldsymbol{X}\in\mathbb{R}^{N\times pV}$ be a randomly generated feature matrix where $\boldsymbol{X}_{i,:}\sim\mathcal{N}_{pV}\left(\boldsymbol{\mu},\,\boldsymbol{\Sigma}\right)$ . 4. 4.

Let $\boldsymbol{x}_{i}^{v}=\boldsymbol{X}_{i,\left(1+p(v-1)\right):pv}$ for $1\leq i\leq N$ and $1\leq v\leq V$ . 5. 5.

Let $\boldsymbol{g}:\>\mathbb{R}^{pV}\rightarrow\mathbb{R}^{K}$ and $f:\>\mathbb{R}^{K}\rightarrow\mathbb{R}$ be randomly initialized MLPs with ReLU nonlinearities. 6. 6.

Let $c_{i,k}=\boldsymbol{1}_{\left\{\left[\boldsymbol{g}\left(\boldsymbol{X}_{i,:}\right)\right]_{k}\geq m_{k}\right\}}$ , where $m_{k}=\mathrm{median}\left(\left\{\left[\boldsymbol{g}\left(\boldsymbol{X}_{l,:}\right)\right]_{k}\right\}_{l=1}^{N}\right)$ , for $1\leq i\leq N$ and $1\leq k\leq K$ . 7. 7.

Let $y_{i}=\boldsymbol{1}_{\left\{f\left(\boldsymbol{c}_{i}\right)\geq m_{y}\right\}}$ , where $m_{y}=\textrm{median}\left(\left\{f\left(\boldsymbol{c}_{i}\right\}_{l=1}^{N}\right)\right)$ , for $1\leq i\leq N$ .

Observe that the procedure above results in $N$ triples $\left(\left\{\boldsymbol{x}_{i}^{v}\right\}_{v=1}^{V},\,\boldsymbol{c}_{i},\,y_{i}\right)$ , for $1\leq i\leq N$ . By contrast with the appendicitis ultrasonography dataset (Appendix A), herein, all data points have the same number of views. In our experiments, we set $N=\mbox{8000}$ , $p=500$ , $V=3$ , and $K=30$ . 2000 data points were held out as a test set. The simulation was repeated across ten independent replications.

Appendix C Multiview Animals with Attributes

In addition to the purely synthetic classification task described above, we adapted a popular attribute-based classification dataset Animals with Attributes 2 (AwA) [100, 132] to the multiview scenario. The original AwA consists of 37322 images of 50 animal classes with $K=85$ binary-valued concepts, i.e. attributes. Similar to the UCSD Birds experiment for vanilla CBMs [97], the concepts are labeled per class and not per data point, e.g. all polar bears are assumed to have white fur. We extended AwA by randomly cropping $V=4$ patches, 60 $\times$ 60 px2 big, from each original image $i$ to produce multiple “views”, as shown in Figure C.1. Note that, while the concepts are only partially observable from individual images, there is no ordering among the patches, and, for simplicity, we generate the same number of views for each data point. Nevertheless, compared with the original AwA, classification based on a single view becomes markedly more challenging. During the experiments, we divided the dataset according to the 60%-20%-20% train-validation-test split. Simulations were repeated ten times independently.

Appendix D Optimization Procedure for the SSMVCBM

The training procedure is summarized in Algorithm D.1.

Appendix E Further Implementation Details

E.1 Architectures

Table E.1 provides a detailed description of the MVCBM architectures implemented in our experiments. Herein, $B$ denotes the batch size, $V$ is the maximum number of views, $K$ is the number of concepts, $H$ is the number of units in the hidden layer of $f_{\boldsymbol{\theta}}(\cdot)$ , and $N_{o}$ is the number of output units. Table E.1(a) shows the architectures for the tabular synthetic data, and Table E.1(b) shows the architectures utilized for the image datasets. As can be seen, the encoder network $\boldsymbol{h}_{\boldsymbol{\psi}}(\cdot)$ is different: in the first case, it is fully connected, whereas, in the latter, it is comprised of the ResNet-18 without the penultimate fully connected layer. Notably, the number of output units $N_{o}$ and the activation function depend on the number of classes of the target variable.

Note that, in the appendicitis dataset, all US image sequences were padded to the length of $V=20$ . However, as intended, fusion layers discard the padding and can be applied to variable-length sequences. As mentioned in Appendices B and C, we considered $V=3$ and $4$ views for the synthetic and MVAwA datasets, respectively. The number of concepts was $K=30$ , $85$ , and $9$ for the synthetic, MVAwA, and appendicitis datasets, respectively. Notably, in MVAwA, the input images were 224 $\times$ 224 px, while US images were 400 $\times$ 400 px big. For the synthetic dataset and MVAwA, we fixed $H=100$ , and for the appendicitis data, it was set to $5$ . For MVAwA, the output layer was $N_{o}=50$ units wide and had Softmax activation. Since all labels in the pediatric appendicitis dataset were binary, we set $N_{o}=1$ and used Sigmoid activation.

For the SSMVCBM, we had to choose architectures for the concept prediction and representation learning “branches” of the model, given by Eq. (6). For both, we utilized architectures similar to those from Table E.1. For representation learning, instead of the sigmoid, we applied the hyperbolic tangent activation function at the output of $\boldsymbol{s}_{\boldsymbol{\zeta}}^{\boldsymbol{z}}(\cdot)$ . Another architectural hyperparameter of the SSMVCBM is the number of dimensions of the vector $\boldsymbol{\hat{z}}_{i}$ , denoted by $J$ . In the experiments reported in Figure 4, $J$ was set to the difference between the number of the ground-truth concepts and the number of the concepts given to the model during training. For the experiments from Table F.1, we set $J=24$ . Finally, for pediatric appendicitis, we fixed $J=5$ across all target variables. Lastly, another architectural difference from the MVCBM was that the number of inputs in the target model $f_{\boldsymbol{\theta}}(\cdot)$ had to be $K+J$ . For more detailed architecture specifications not covered above, see our code at https://github.com/i6092467/semi-supervised-multiview-cbm.

E.2 Hyperparameters

In all experiments, deep learning models were trained using the Adam optimizer. To avoid potential overfitting on the moderately-sized appendicitis dataset, throughout training, we applied on-the-fly data augmentation with Gaussian noise addition, random black rectangle insertion, and one additional randomly chosen transformation: brightness adjustment, rotation, shearing, resizing, change of image sharpness, or gamma correction.

Applicable model hyperparameter values used for the synthetic, MVAwA, and appendicitis datasets are provided in Tables E.2–E.6. The numbers of training epochs and learning rates were selectively tuned on the training set using five-fold cross-validation. In the tables below, by $E_{y}$ and $\eta_{y}$ , we denote the number of epochs used to train a model and the initial learning rate, respectively. Note that sequentially trained MVCBMs allow for a separate hyperparameter configuration for the concept model $\boldsymbol{g}_{\boldsymbol{\phi}}(\cdot)$ . We exploit this possibility for the number of epochs ( $E_{\boldsymbol{c}}$ ) and the initial learning rate ( $\eta_{\boldsymbol{c}}$ ). Due to the lack of this freedom, we have found that jointly trained MVCBMs sometimes require tuning $E_{y}$ and $\eta_{y}$ for the model weights to converge. Recall that parameter $\alpha$ controls the trade-off between the target and concept loss terms in the jointly trained concept bottleneck models. We did not explore the influence of this hyperparameter, fixing it to $\alpha=1.0$ . The remaining parameters belong to the semi-supervised variant of the MVCBM (see the procedure in Algorithm D.1): $C$ denotes the number of iterations for the adversarial training; $E_{\boldsymbol{z}}$ and $\eta_{\boldsymbol{z}}$ are the number of training epochs and learning rate, respectively, for the representation learning module; $E_{a}$ and $\eta_{a}$ are the number of epochs and learning rate for training the adversary network; and, finally, $\lambda$ is the parameter controlling the weight of the adversarial penalty in the loss function for optimizing the representation learning module and target model parameters.

Appendix F Further Results

F.1 Multiview Animals with Attributes

As mentioned, we also adapted a popular natural image attribute-based Animals with Attributes 2 dataset [100, 132] to the multiview classification (Appendix C). The main challenge of this dataset is that only some concepts may be identifiable from every view because cropping may remove an image region with the input relevant to a specific concept. During model comparison, we trained and evaluated classifiers by performing a train-test split on several independent simulations, i.e. replicates.

For this dataset, the experiment results are very similar to the ones on the synthetic data: (i) multiview techniques perform superior to single-view techniques, as shown in Figures F.1(a)–(b); (ii) when given the complete concept set, MVCBM is comparable to an end-to-end black-box, as shown in Figure F.1(a); and (iii) the proposed multiview and semi-supervised extensions of the CBM are intervenable, as shown in Figure F.1(c). Herein, for MVCBMs, we focused on a simple approach to aggregating multiple views and a single optimization procedure; however, other design choices are plausible. Table F.1 reports additional results with alternative fusion functions and optimization schemes for the MVAwA experiment under the complete concept set.

F.2 SSMVCBM Ablation

As mentioned before, the semi-supervised variant of the proposed multiview concept bottleneck model includes an adversarial regularizer to de-correlate learned representations $\boldsymbol{\hat{z}}\in\mathbb{R}^{J}$ and concept predictions $\boldsymbol{\hat{c}}\in\mathbb{R}^{K}$ (Eq. (7)). To better understand the impact of this regularization on the model’s predictive performance and intervenability, we performed an ablation study on the synthetic tabular nonlinear and MVAwA datasets by training semi-supervised concept bottlenecks under varying values of the regularization parameter $\lambda\in\{0,\,0.01,\,0.1\}$ .

In this experiment, we assessed the predictive performance and intervenability of the resulting models and the correlation among the individual dimensions of the concept predictions and representations. For the latter, we have utilized Pearson’s correlation coefficient conditional on the target variable; in particular, we have looked at the median absolute value of the pairwise correlation coefficient given by $\mathrm{median}_{i,j,k}\>\left|\widehat{\mathrm{corr}}\left(\hat{c}_{i},\hat{z}_{j}\,|\,y=k\right)\right|$ , where $\hat{c}_{i}$ and $\hat{z}_{j}$ denote the $i$ -th and $j$ -th components of the concept and representation vectors, respectively. For both datasets, the experiment was run under the incomplete concept set: $K=5$ (out of 30) observed concepts and $J=25$ for synthetic data and $K=10$ (out of 85) and $J=75$ for MVAwA. All results reported below correspond to the multiview CBMs with the averaging-based fusion. Figure F.2 summarizes the results of the ablation study.

It appears that stronger regularization expectedly hurts the performance at predicting the target variable but allows learning representations de-correlated from the given concepts, as shown in Figures F.2(a) and F.2(b). However, even in the absence of the adversarial regularization ( $\lambda=0$ ), $\boldsymbol{\hat{c}}$ and $\boldsymbol{\hat{z}}$ are already relatively weakly correlated. Importantly, regularized models demonstrate a steeper increase in predictive performance during interventions on predicted concepts (Figure F.2(c)). Moreover, when most of the concepts have been intervened on, the unregularized model predicts the target variable more poorly than the regularized ones.

In summary, we observed that the adversarial regularizer in the SSMVCBM’s loss function helps de-correlated representation learning and improves the model’s intervenability, albeit it may reduce the non-intervened model’s predictive performance. In future work, it would be interesting to seek alternative regularization techniques for disentangling concepts and representations, possibly focusing explicitly on intervenability.

F.3 LSTM-based Fusion and View Ordering

To explore the sensitivity of the LSTM-based multiview concept bottlenecks to the view order, we performed an additional experiment on the pediatric appendicitis dataset, where the views are ordered chronologically. After randomly shuffling the views within every subject, we applied all LSTM-based multiview CBMs to the test set. Thus, the models trained on ordered data were assessed on the set with the perturbed order.

Table F.2 reports the predictive performance obtained on the original test set (for reference) and after shuffling the views for the three target variables. Expectedly, all LSTM-based multiview CBMs are sensitive to the order of inputs, especially for predicting the management and severity. The relative performance decrease after shuffling is particularly remarkable for the SSMVCBM.

Thus, the results suggest that the LSTM-based fusion, expectedly, is sensitive to the order of input views and, therefore, when deployed, LSTM-based (SS)MVCBMs should be applied with caution, preserving the ordering of images represented in the training data.

F.4 True and False Positive Rates for Predicting Appendicitis

In addition to the overall AUROCs and AUPRs for the target prediction reported in Table 6, for the diagnosis, we examined false positive rates attained by the predictive models at fixed percentages of the true positive rate. The results of this analysis are shown in Table F.3 for the TPRs of 75, 80, 90, 95, and 99%.

All models have relatively high FPRs, $>30\%$ , at all considered TPRs. Thus, in general, the amount of false positive predictions necessary to achieve a satisfactory TPR is too high for the predictive models to be used without human expert attendance. For the TPRs of 95 and 99%, most models have an FPR of at least 80%.

Generally, the ordering among the models w.r.t. the predictive performance and relative performance of the different model classes are similar to those based on AUROCs and AUPRs (Table 6). In a similar vein to the results reported in Section 3.3, this further exploration suggests that the models’ performance has to be improved for them to be practical and autonomous.

F.5 Brier Scores for Concept Prediction

To supplement AUROCs and AUPRs reported in Tables 3-5, we evaluated concept predictions in terms of the Brier score, as shown in Table F.4. Note that the scores were not adjusted for class imbalance, and most concept variables had few positive observations (Table 2). The findings from this analysis are described and discussed in Sections 3 and 4 of the main text, respectively.

Appendix G Online Prediction Tool

Below, we provide details on the implementation of our online pediatric appendicitis prediction tool. We must emphasize that the current version is a research prototype and should be utilized solely for non-commercial, educational purposes, and not for clinical decision-making. The web tool deploys a multiview concept bottleneck model trained to predict the diagnosis using the sequential optimization procedure and LSTM to fuse the views (MVCBM-seq-LSTM in Table 6). We use a single set of parameters obtained after training from one of the initializations included in the experiments. Note that the model was not re-trained on the complete dataset.

Workflow

Figure G.1 contains a workflow diagram for the website. Specifically, the worker thread handles incoming requests and creates a new server-side session if it does not exist for the current user. The images uploaded by the user are saved in the session. If requested, UI element regions are masked and filled, and CLAHE is applied to the input images. Note that due to the characteristics of the images, the effectiveness of preprocessing may be limited. In particular, UI artifacts, such as text, logos and diagrams, which differ considerably from those in our collected dataset, may not be completely masked and filled. The processed images are then forwarded to the trained MVCBM network, which predicts the concept values and the diagnosis label and displays them. The user may intervene if they choose and re-calculate the final prediction using adjusted concept values. In the background, the session cleanup thread is started along with the web application. It iterates every 60 seconds over all stored session objects. Sessions that have been inactive for over 30 minutes are eliminated, along with all related data. After this, no data provided by the user or data resulting from processing the user’s uploaded data are retained.

Bibliography136

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Amish Acharya, Sheraz R. Markar, Melody Ni and George B. Hanna “Biomarkers of acute appendicitis: systematic review and cost–benefit trade-off analysis” In Surgical Endoscopy 31.3 Springer Science Business Media LLC, 2016, pp. 1022–1031 DOI: 10.1007/s 00464-016-5109-1 · doi ↗
2[2] Ehsan Adeli, Qingyu Zhao, Adolf Pfefferbaum, Edith V. Sullivan, Li Fei-Fei, Juan Carlos Niebles and Kilian M. Pohl “Representation Learning with Statistical Independence to Mitigate Bias” In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) Waikoloa, HI, USA: IEEE, 2021 DOI: 10.1109/wacv 48630.2021.00256 · doi ↗
3[3] Omer F. Akmese, Gul Dogan, Hakan Kor, Hasan Erbay and Emre Demir “The Use of Machine Learning Approaches for the Diagnosis of Acute Appendicitis” In Emergency Medicine International 2020 Hindawi Limited, 2020, pp. 1–8 DOI: 10.1155/2020/7306435 · doi ↗
4[4] Alfredo Alvarado “A practical score for the early diagnosis of acute appendicitis” In Annals of Emergency Medicine 15.5 , 1986, pp. 557–564 DOI: 10.1016/S 0196-0644(86)80993-3 · doi ↗
5[5] Roland E. Andersson “The Natural History and Traditional Management of Appendicitis Revisited: Spontaneous Resolution and Predominance of Prehospital Perforations Imply That a Correct Diagnosis is More Important Than an Early Diagnosis” In World Journal of Surgery 31.1 Springer Science Business Media LLC, 2006, pp. 86–92 DOI: 10.1007/s 00268-006-0056-y · doi ↗
6[6] Emrah Aydin, İnan Utku Türkmen, Gözde Namli, Çiğdem Öztürk, Ayşe B. Esen, Y. Eray, Egemen Eroğlu and Fatih Akova “A novel and simple machine learning algorithm for preoperative diagnosis of acute appendicitis in children” In Pediatric Surgery International 36.6 Springer Science Business Media LLC, 2020, pp. 735–742 DOI: 10.1007/s 00383-020-04655-7 · doi ↗
7[7] Tadas Baltrušaitis, Chaitanya Ahuja and Louis-Philippe Morency “Multimodal Machine Learning: A Survey and Taxonomy” In IEEE Transactions on Pattern Analysis and Machine Intelligence 41.2 Institute of Electrical Electronics Engineers (IEEE), 2019, pp. 423–443 DOI: 10.1109/TPAMI.2018.2798607 · doi ↗
8[8] Aneel Bhangu, Kjetil Søreide, Salomone Di Saverio, Jeanette Hansson Assarsson and Frederick Thurston Drake “Acute appendicitis: modern understanding of pathogenesis, diagnosis, and management” In The Lancet 386.10000 Elsevier BV, 2015, pp. 1278–1287 DOI: 10.1016/s 0140-6736(15)00275-5 · doi ↗