Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts

Zixuan Hu; Dongxiao Li; Xinzhu Ma; Shixiang Tang; Xiaotong Li; Wenhan Yang; Ling-Yu Duan

arXiv:2508.20488·cs.CV·August 29, 2025

Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts

Zixuan Hu, Dongxiao Li, Xinzhu Ma, Shixiang Tang, Xiaotong Li, Wenhan Yang, Ling-Yu Duan

PDF

Open Access 1 Datasets

TL;DR

This paper introduces DUO, a novel test-time adaptation framework that jointly minimizes semantic and geometric uncertainties to improve monocular 3D object detection robustness under real-world domain shifts.

Contribution

We propose the first TTA method for M3OD that addresses dual uncertainties with a convex focal loss structure and semantic-aware geometric constraints.

Findings

01

DUO outperforms existing methods across multiple datasets.

02

The convex focal loss enables label-agnostic uncertainty weighting.

03

Semantic-aware constraints improve geometric coherence.

Abstract

Accurate monocular 3D object detection (M3OD) is pivotal for safety-critical applications like autonomous driving, yet its reliability deteriorates significantly under real-world domain shifts caused by environmental or sensor variations. To address these shifts, Test-Time Adaptation (TTA) methods have emerged, enabling models to adapt to target distributions during inference. While prior TTA approaches recognize the positive correlation between low uncertainty and high generalization ability, they fail to address the dual uncertainty inherent to M3OD: semantic uncertainty (ambiguous class predictions) and geometric uncertainty (unstable spatial localization). To bridge this gap, we propose Dual Uncertainty Optimization (DUO), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD. Through a convex optimization lens, we introduce an innovative convex…

Tables8

Table 1. Table 1 : Comparisons with state-of-the-art methods on the KITTI-C validation set (severity level 5) with MonoFlex as the base model. We highlight the best and second results with bold and underline respectively.

Car Category
Method	Reference	Noise			Blur			Weather				Digital			Avg
Method	Reference	Gauss.	Shot	Impul.	Defoc.	Glass	Motion	Snow	Frost	Fog	Brit.	Contr.	Pixel	Sat.	Avg
MonoFlex	CVPR’21	0.00	0.00	0.00	0.00	11.02	0.00	5.07	12.94	7.70	14.46	0.00	2.13	5.65	4.54
$∙$ TENT	ICLR’21	5.31	11.09	6.40	2.22	25.49	2.14	23.88	28.96	35.40	37.07	24.67	22.59	30.63	19.68
$∙$ EATA	ICML’22	5.44	12.12	4.67	2.75	25.66	2.45	24.77	28.99	35.67	36.95	24.33	22.47	34.10	20.03
$∙$ DeYO	ICLR’24	5.78	12.52	4.52	3.01	26.05	2.98	24.91	29.40	35.19	37.59	23.75	23.82	34.33	20.30
$∙$ MonoTTA	ECCV’24	5.93	13.34	4.05	3.35	28.10	3.21	25.86	29.10	36.43	37.18	25.90	25.01	33.89	20.87
$∙$ Ours	This paper	7.30	15.40	9.36	4.34	30.23	6.89	29.09	29.76	38.38	37.72	29.35	25.88	34.97	22.97
Pedestrian Category
Method	Reference	Noise			Blur			Weather				Digital			Avg
Method	Reference	Gauss.	Shot	Impul.	Defoc.	Glass	Motion	Snow	Frost	Fog	Brit.	Contr.	Pixel	Sat.	Avg
MonoFlex	CVPR’21	0.00	0.00	0.00	0.00	4.09	0.00	0.51	1.67	1.77	2.09	0.00	0.32	0.95	0.88
$∙$ TENT	ICLR’21	0.71	2.07	0.86	1.36	12.70	0.81	6.72	8.33	11.89	12.79	7.42	6.45	9.78	6.30
$∙$ EATA	ICML’22	0.98	2.15	0.76	1.44	12.23	0.80	6.77	8.67	11.60	13.62	7.53	6.47	10.33	6.41
$∙$ DeYO	ICLR’24	1.03	2.25	0.91	1.64	11.85	0.80	6.63	9.09	11.77	13.99	7.59	6.39	10.57	6.50
$∙$ MonoTTA	ECCV’24	1.77	2.88	0.34	1.78	12.38	0.82	7.03	9.02	12.31	13.11	7.75	7.10	11.08	6.72
$∙$ Ours	This paper	1.89	3.08	1.54	1.86	13.53	1.75	7.68	9.09	12.66	13.99	7.81	7.27	11.27	7.19
Cyclist Category
Method	Reference	Noise			Blur			Weather				Digital			Avg
Method	Reference	Gauss.	Shot	Impul.	Defoc.	Glass	Motion	Snow	Frost	Fog	Brit.	Contr.	Pixel	Sat.	Avg
MonoFlex	CVPR’21	0.00	0.00	0.00	0.00	0.24	0.00	2.14	2.33	1.72	4.41	0.00	0.00	0.00	0.83
$∙$ TENT	ICLR’21	0.06	0.14	0.04	0.04	4.55	0.93	6.63	8.23	11.94	15.16	7.72	1.85	2.81	4.62
$∙$ EATA	ICML’22	0.05	0.15	0.03	0.02	4.66	1.10	6.73	7.58	13.77	14.99	7.32	2.03	2.82	4.71
$∙$ DeYO	ICLR’24	0.06	0.19	0.03	0.03	4.91	1.08	6.48	6.91	13.94	14.67	7.57	1.79	2.82	4.65
$∙$ MonoTTA	ECCV’24	0.05	0.12	0.01	0.02	4.80	1.25	6.75	8.24	13.31	14.95	7.55	2.11	2.88	4.77
$∙$ Ours	This paper	0.11	0.22	0.07	0.10	6.00	2.00	6.89	8.41	13.58	15.94	7.94	2.13	2.92	5.10

Table 2. Table 2 : Comparisons with state-of-the-art methods on the KITTI-C validation set (severity level 5) with MonoGround as the base model. Due to the space limit, the complete results of three categories are provided in Tab. 7 of Appendix. B .

Car Category
Method	Reference	Noise			Blur			Weather				Digital			Avg
Method	Reference	Gauss.	Shot	Impul.	Defoc.	Glass	Motion	Snow	Frost	Fog	Brit.	Contr.	Pixel	Sat.	Avg
MonoGround	CVPR’22	0.00	0.00	0.00	0.00	11.63	0.29	1.95	6.59	3.14	19.25	0.00	4.66	3.74	3.94
$∙$ TENT	ICLR’21	6.82	14.81	8.21	4.88	28.38	2.65	23.92	28.08	33.06	36.70	20.22	30.63	33.27	20.90
$∙$ EATA	ICML’22	7.12	15.26	8.81	5.09	29.08	2.52	24.18	28.03	33.43	36.78	21.61	30.50	33.42	21.22
$∙$ DeYO	ICLR’24	7.35	15.72	9.38	5.74	30.01	2.99	25.03	28.55	34.32	37.31	23.41	30.99	34.16	21.92
$∙$ MonoTTA	ECCV’24	7.88	16.73	10.35	5.97	31.19	3.06	25.24	28.99	34.85	37.82	25.00	31.61	34.79	22.57
$∙$ Ours	This paper	9.72	18.88	12.74	7.24	33.02	5.24	28.50	30.73	37.27	39.40	28.34	33.22	37.24	24.73

Table 3. Table 3 : Comparison with baselines on D aytime ↔ \leftrightarrow N ight and S unny ↔ \leftrightarrow R ainy of nuScenes dataset, regarding AP 3 D | R 40 {\rm AP}_{{\rm 3D}|{\rm R}_{40}} .

Method	MonoFlex			MonoGround
Method	D $\to$ N	N $\to$ D	Avg.	D $\to$ N	N $\to$ D	Avg.
Source model	1.53	2.75	2.14	6.97	1.09	4.03
$∙$ TENT	3.33	3.45	3.39	8.36	1.66	5.01
$∙$ DeYO	4.72	4.87	4.79	11.01	1.40	6.21
$∙$ MonoTTA	6.92	3.68	5.30	13.61	1.29	7.45
$∙$ Ours	9.05	5.41	7.23	15.70	1.91	8.81
Method	S $\to$ R	R $\to$ S	Avg.	S $\to$ R	R $\to$ S	Avg.
Source model	6.86	10.91	8.89	8.03	7.44	7.73
$∙$ TENT	8.53	11.61	10.07	9.94	8.71	9.33
$∙$ DeYO	9.33	12.04	10.68	10.54	9.10	9.82
$∙$ MonoTTA	9.47	12.55	11.01	10.89	8.90	9.90
$∙$ Ours	11.54	13.21	12.38	12.88	9.54	11.21

Table 4. Table 4 : Effects of components in our method. We conduct ablation studies of the conjugate focal loss ℒ CFL \mathcal{L}_{\text{CFL}} , normal consistency loss ℒ NCL \mathcal{L}_{\text{NCL}} , and semantic guidance ℳ \mathcal{M} on the KITTI-C validation set. ”Src.” denotes the source model without adaptation.

Src.	$ℒ_{CFL}$	$ℒ_{NCL}$	$ℳ$	MonoFlex				MonoGround
Src.	$ℒ_{CFL}$	$ℒ_{NCL}$	$ℳ$	Car	Pedes.	Cyc.	Avg.	Car	Pedes.	Cyc.	Avg.
✔				4.54	0.88	0.83	2.08	3.94	1.79	0.52	2.08
	✔			20.98	6.60	4.32	10.63	22.89	8.86	2.51	11.42
		✔		12.38	4.63	2.82	6.61	15.69	6.40	1.91	8.00
	✔	✔		22.17	6.99	4.72	11.29	24.29	9.27	2.78	12.11
		✔	✔	16.49	6.23	4.87	9.20	19.68	7.98	2.94	10.20
	✔	✔	✔	22.97	7.19	5.10	11.75	24.73	9.62	3.02	12.46

Table 5. Table 5 : Comparisons with state-of-the-art methods on the KITTI-C validation set (severity level 3) in Car category. We highlight the best and second results with bold and underline respectively.

Method	Reference	Noise			Blur			Weather				Digital			Avg
Method	Reference	Gauss.	Shot	Impul.	Defoc.	Glass	Motion	Snow	Frost	Fog	Brit.	Contr.	Pixel	Sat.	Avg
MonoFlex	CVPR’21	0.63	0.49	0.63	1.09	26.10	0.71	14.21	15.88	10.16	27.88	4.41	11.61	39.25	11.77
$∙$ TENT	ICLR’21	17.99	26.99	22.29	13.46	35.73	9.36	32.52	30.99	38.13	40.67	39.28	34.46	43.37	29.63
$∙$ EATA	ICML’22	18.21	27.52	22.83	14.86	36.01	13.98	33.11	31.45	38.35	40.62	39.55	35.23	43.44	30.39
$∙$ DeYO	ICLR’24	18.36	28.49	23.15	15.04	36.44	16.38	33.67	31.32	38.57	40.75	39.93	35.81	43.58	30.89
$∙$ MonoTTA	ECCV’24	19.64	28.37	24.45	17.79	35.91	17.20	34.11	31.78	39.45	40.83	40.74	36.27	43.46	31.54
$∙$ Ours	This paper	21.18	29.43	25.43	19.09	36.85	18.88	35.32	31.96	39.77	41.64	41.71	36.73	43.93	32.46
MonoGround	CVPR’22	0.51	0.52	0.86	2.47	25.71	0.35	10.68	9.99	5.59	32.31	0.81	14.94	36.06	10.83
$∙$ TENT	ICLR’21	20.01	31.16	25.56	17.72	38.63	10.47	33.58	30.83	38.06	42.78	40.11	39.56	45.49	31.84
$∙$ EATA	ICML’22	20.36	31.84	26.63	18.39	38.77	14.21	34.03	31.08	37.93	42.32	40.32	39.57	45.30	32.37
$∙$ DeYO	ICLR’24	20.80	32.31	27.32	19.33	38.63	15.37	34.58	31.43	33.95	42.97	40.33	39.83	45.20	32.47
$∙$ MonoTTA	ECCV’24	22.10	33.93	28.35	22.49	39.88	16.64	32.71	32.42	39.93	42.69	40.61	39.92	44.85	33.58
$∙$ Ours	This paperX	23.65	34.79	29.23	23.08	40.63	18.66	34.75	33.38	40.39	42.97	41.64	40.53	44.91	34.51

Table 6. Table 6 : Comparisons with state-of-the-art methods on the KITTI-C validation set (severity level 1) in Car category. We highlight the best and second results with bold and underline respectively.

Method	Reference	Noise			Blur			Weather				Digital			Avg
Method	Reference	Gauss.	Shot	Impul.	Defoc.	Glass	Motion	Snow	Frost	Fog	Brit.	Contr.	Pixel	Sat.	Avg
MonoFlex	CVPR’21	12.97	20.42	15.02	20.37	36.51	11.61	32.26	30.61	19.69	45.33	20.01	29.09	42.44	25.87
$∙$ TENT	ICLR’21	29.84	38.55	34.54	34.93	40.52	25.08	39.68	40.53	40.30	44.37	44.04	41.10	43.92	38.26
$∙$ EATA	ICML’22	30.13	38.69	34.77	35.43	40.16	27.93	39.85	40.38	40.60	44.81	44.43	41.39	44.30	38.68
$∙$ DeYO	ICLR’24	30.58	38.82	34.93	36.04	41.00	28.64	39.96	40.51	40.62	44.79	44.46	41.48	44.90	38.98
$∙$ MonoTTA	ECCV’24	32.34	39.05	35.68	36.58	40.69	30.25	39.70	40.01	41.22	44.76	44.88	41.91	44.20	39.33
$∙$ Ours	This paper	32.28	39.42	36.78	36.87	40.91	31.33	39.88	40.71	41.23	44.90	44.41	42.47	44.44	39.66
MonoGround	CVPR’22	13.05	22.05	19.41	20.75	38.72	8.40	30.65	27.66	14.56	46.22	14.95	33.40	36.29	25.08
$∙$ TENT	ICLR’21	34.94	42.76	37.93	37.79	44.95	25.15	40.67	42.77	41.26	47.05	45.12	43.73	46.56	40.82
$∙$ EATA	ICML’22	35.36	42.47	38.85	38.24	44.87	26.44	40.64	42.61	41.65	46.94	45.18	43.71	46.54	41.04
$∙$ DeYO	ICLR’24	35.88	42.07	39.86	38.51	44.81	28.01	40.60	42.49	41.95	46.83	45.25	43.67	46.54	41.27
$∙$ MonoTTA	ECCV’24	37.05	42.86	39.52	39.25	44.59	32.66	40.54	42.47	42.13	45.95	44.98	43.38	46.15	41.66
$∙$ Ours	This paper	37.25	43.31	40.21	39.80	45.30	34.16	41.39	42.80	42.84	46.61	45.93	43.80	46.66	42.31

Table 7. Table 7 : Comparisons with state-of-the-art methods on the KITTI-C validation set (severity level 5) with MonoGround. We highlight the best and second results with bold and underline respectively.

Car Category
Method	Reference	Noise			Blur			Weather				Digital			Avg
Method	Reference	Gauss.	Shot	Impul.	Defoc.	Glass	Motion	Snow	Frost	Fog	Brit.	Contr.	Pixel	Sat.	Avg
MonoGround	CVPR’22	0.00	0.00	0.00	0.00	11.63	0.29	1.95	6.59	3.14	19.25	0.00	4.66	3.74	3.94
$∙$ TENT	ICLR’21	6.82	14.81	8.21	4.88	28.38	2.65	23.92	28.08	33.06	36.70	20.22	30.63	33.27	20.90
$∙$ EATA	ICML’22	7.12	15.26	8.81	5.09	29.08	2.52	24.18	28.03	33.43	36.78	21.61	30.50	33.42	21.22
$∙$ DeYO	ICLR’24	7.35	15.72	9.38	5.74	30.01	2.99	25.03	28.55	34.32	37.31	23.41	30.99	34.16	21.92
$∙$ MonoTTA	ECCV’24	7.88	16.73	10.35	5.97	31.19	3.06	25.24	28.99	34.85	37.82	25.00	31.61	34.79	22.57
$∙$ Ours	This paper	9.72	18.88	12.74	7.24	33.02	5.24	28.50	30.73	37.27	39.40	28.34	33.22	37.24	24.73
Pedestrian Category
Method	Reference	Noise			Blur			Weather				Digital			Avg
Method	Reference	Gauss.	Shot	Impul.	Defoc.	Glass	Motion	Snow	Frost	Fog	Brit.	Contr.	Pixel	Sat.	Avg
MonoGround	CVPR’22	0.00	0.00	0.00	0.00	14.76	0.00	0.28	0.74	0.68	4.63	0.00	0.34	1.80	1.79
$∙$ TENT	ICLR’21	1.47	2.91	1.01	1.19	15.19	0.66	6.98	10.44	14.95	17.49	11.10	10.72	8.72	7.91
$∙$ EATA	ICML’22	1.85	2.86	1.05	1.31	14.02	0.79	7.41	10.08	14.72	17.57	11.31	11.20	9.38	7.97
$∙$ DeYO	ICLR’24	2.25	2.81	1.08	1.46	13.28	0.92	7.75	9.74	14.45	17.64	11.49	11.68	9.99	8.04
$∙$ MonoTTA	ECCV’24	2.40	4.74	1.52	1.60	16.31	1.09	8.95	11.06	14.72	17.96	10.62	12.39	12.11	8.88
$∙$ Ours	This paper	2.26	5.03	1.85	2.24	16.26	2.29	10.44	12.37	15.50	18.89	12.59	12.35	12.95	9.62
Cyclist Category
Method	Reference	Noise			Blur			Weather				Digital			Avg
Method	Reference	Gauss.	Shot	Impul.	Defoc.	Glass	Motion	Snow	Frost	Fog	Brit.	Contr.	Pixel	Sat.	Avg
MonoGround	CVPR’22	0.00	0.00	0.00	0.00	0.47	0.00	0.10	1.20	0.21	3.85	0.00	0.76	0.19	0.52
$∙$ TENT	ICLR’21	1.77	0.14	0.04	0.07	2.92	0.31	1.92	2.70	6.90	8.14	1.08	1.51	2.71	2.32
$∙$ EATA	ICML’22	0.88	0.13	0.05	0.06	2.94	0.40	2.03	2.81	6.91	8.36	1.32	1.52	2.98	2.34
$∙$ DeYO	ICLR’24	0.00	0.12	0.07	0.06	2.96	0.49	2.13	2.93	6.91	8.58	1.57	1.54	3.25	2.35
$∙$ MonoTTA	ECCV’24	0.04	0.10	0.04	0.15	3.59	0.52	2.51	3.96	8.45	7.80	3.00	2.90	3.61	2.82
$∙$ Ours	This paper	0.05	0.14	0.07	0.24	4.01	0.70	2.67	4.20	8.79	8.22	3.91	2.55	3.72	3.02

Table 8. Table 8 : Running time comparison of various methods. We assess TTA approaches for processing 1k images in Gaussian corruption type, using a single Nvidia RTX 4090 GPU.

Metrics	Source Model	TENT	EATA	DeYO	MonoTTA	Ours
Running Time	26s	31s	29s	87s	33s	32s

Equations89

{B_{t}}, {C_{t}} \leftarrow h_{θ} (I_{t}), θ \leftarrow θ - \nabla L_{tt a} (I_{t}),

{B_{t}}, {C_{t}} \leftarrow h_{θ} (I_{t}), θ \leftarrow θ - \nabla L_{tt a} (I_{t}),

L_{FL} (x, y) = - α (1 - p)^{γ}

L_{FL} (x, y) = - α (1 - p)^{γ}

L_{FL}

L_{FL}

f (h) = α lo g s, g (h

h min {f (h) - y^{⊤} g (h)} = z = g (h) min {f \circ g^{- 1} (z) - y^{⊤} z} = f^{*} (y) .

h min {f (h) - y^{⊤} g (h)} = z = g (h) min {f \circ g^{- 1} (z) - y^{⊤} z} = f^{*} (y) .

f \circ g^{- 1} (z) - y^{⊤} z = f^{*} (y), \nabla_{z} (f \circ g^{- 1}) = y .

f \circ g^{- 1} (z) - y^{⊤} z = f^{*} (y), \nabla_{z} (f \circ g^{- 1}) = y .

y_{0} ≜ \frac{\nabla _{h} ( f \circ g ^{- 1} )}{\nabla _{h} z} ∣_{z = g (h)} = \nabla_{h} g (h)^{- 1} \cdot \nabla_{h} f (h)

y_{0} ≜ \frac{\nabla _{h} ( f \circ g ^{- 1} )}{\nabla _{h} z} ∣_{z = g (h)} = \nabla_{h} g (h)^{- 1} \cdot \nabla_{h} f (h)

\approx (I + γ (1 - lo g p) \cdot p p^{⊤} - γ lo g p \cdot diag (p)))^{- 1} p,

L_{CFL} (x) =

L_{CFL} (x) =

-

\nabla D_{x} = S_{x} * D, \nabla D_{y} = S_{y} * D,

\nabla D_{x} = S_{x} * D, \nabla D_{y} = S_{y} * D,

N (u, v) = \frac{1}{\nabla D _{x}^{2} + \nabla D _{y}^{2} + 1} - \nabla D_{x} - \nabla D_{y} 1,

N (u, v) = \frac{1}{\nabla D _{x}^{2} + \nabla D _{y}^{2} + 1} - \nabla D_{x} - \nabla D_{y} 1,

ψ_{x} (u, v) = ∥2 N (u, v) - N (u + 1, v) - N (u - 1, v) ∥_{2}^{2},

ψ_{x} (u, v) = ∥2 N (u, v) - N (u + 1, v) - N (u - 1, v) ∥_{2}^{2},

ψ_{y} (u, v) = ∥2 N (u, v) - N (u, v + 1) - N (u, v - 1) ∥_{2}^{2},

L_{NCL} (u, v) = (ψ_{x} (u, v) + ψ_{y} (u, v)) \cdot exp (- ∥\nabla I (u, v) ∥_{2}),

L_{NCL} (u, v) = (ψ_{x} (u, v) + ψ_{y} (u, v)) \cdot exp (- ∥\nabla I (u, v) ∥_{2}),

R = {i ∣ U_{i} \leq \overline{U}}, \overline{U} \leftarrow β \cdot \frac{Σ _{i = 1}^{n} U _{i}}{n} + (1 - β) \cdot \overline{U},

R = {i ∣ U_{i} \leq \overline{U}}, \overline{U} \leftarrow β \cdot \frac{Σ _{i = 1}^{n} U _{i}}{n} + (1 - β) \cdot \overline{U},

M (u, v) = i \in R max s_{i} \cdot I_{inside} (u, v ∣ B_{i}),

M (u, v) = i \in R max s_{i} \cdot I_{inside} (u, v ∣ B_{i}),

θ min Σ_{x \in I} L_{CFL} (x) + λ Σ_{(u, v) \in I} M (u, v) \cdot L_{NCL} (u, v),

θ min Σ_{x \in I} L_{CFL} (x) + λ Σ_{(u, v) \in I} M (u, v) \cdot L_{NCL} (u, v),

L_{FL} (x, y)

L_{FL} (x, y)

\displaystyle=-\alpha(1-p)^{\gamma}\cdot y\left(\log\left(\begin{array}[]{c}e^{h_{1}}\\ \vdots\\ e^{h_{c}}\end{array}\right)-\log\left(e^{h_{1}}+\cdots+e^{h_{c}}\right)\right)

\displaystyle=-\alpha(1-p)^{\gamma}\cdot y\cdot\log\left(\begin{array}[]{c}e^{h_{1}}\\ \vdots\\ e^{h_{c}}\end{array}\right)+\alpha(1-p)^{\sigma}y\log s

= - α (1 - p)^{γ} \cdot y \cdot h + α \cdot y lo g s + α ((1 - p)^{σ} - 1) y lo g s .

L_{FL} (x, y)

L_{FL} (x, y)

= α lo g s + α y (- (1 - p)^{γ} \cdot (h - lo g s) - lo g s)

= α lo g s + α y (- lo g s - (1 - p)^{γ} lo g p)

= α lo g s - y^{⊤} (α lo g s + α (1 - p)^{γ} lo g p)

= f (h) α lo g s - y^{⊤} g (h) (α h + α ((1 - p)^{γ} - 1) lo g p) .

\nabla_{h} g = I + diag (p - 2 - 2 (1 - p) lo g p) \cdot (diag (p) - p p^{⊤}) := I + D \cdot H .

\nabla_{h} g = I + diag (p - 2 - 2 (1 - p) lo g p) \cdot (diag (p) - p p^{⊤}) := I + D \cdot H .

v^{⊤} Hv = i = 1 \sum C p_{i} v_{i}^{2} - (i = 1 \sum C p_{i} v_{i})^{2}

v^{⊤} Hv = i = 1 \sum C p_{i} v_{i}^{2} - (i = 1 \sum C p_{i} v_{i})^{2}

v^{⊤} Hv = E [V^{2}] - (E [V])^{2} = Var (V)

v^{⊤} Hv = E [V^{2}] - (E [V])^{2} = Var (V)

v^{⊤} Hv \geq 0 \forall v \in R^{C}

v^{⊤} Hv \geq 0 \forall v \in R^{C}

0 \leq λ (H) \leq i max {2 p_{i} (1 - p_{i})} \leq \frac{1}{2},

0 \leq λ (H) \leq i max {2 p_{i} (1 - p_{i})} \leq \frac{1}{2},

D_{ii} = p_{i} - 2 - 2 (1 - p_{i}) lo g p_{i} .

D_{ii} = p_{i} - 2 - 2 (1 - p_{i}) lo g p_{i} .

λ (\nabla_{h} g) \geq 1 + λ_{m i n} (D) \cdot λ_{m i n} (H) .

λ (\nabla_{h} g) \geq 1 + λ_{m i n} (D) \cdot λ_{m i n} (H) .

λ (\nabla_{h} g) \geq 1 - 2 \cdot λ (H) \geq 1 - 2 \cdot \frac{1}{2} = 0.

λ (\nabla_{h} g) \geq 1 - 2 \cdot λ (H) \geq 1 - 2 \cdot \frac{1}{2} = 0.

h min {f (h) - y^{⊤} g (h)} = z = g (h) min {f \circ g^{- 1} (z) - y^{⊤} z} = f^{*} (y) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

hzcar/duo
dataset· 42 dl
42 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection

Full text

Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts

Zixuan Hu1 Dongxiao Li1 Xinzhu Ma3 Shixiang Tang3 Xiaotong Li1

Wenhan Yang2 Ling-Yu Duan1,2

1School of Computer Science, Peking University, Beijing, China

2Peng Cheng Laboratory, Shenzhen, China, 3The Chinese University of Hong Kong, Hongkong, China.

{hzxuan, lingyu}@pku.edu.cn Corresponding Author.

Abstract

Accurate monocular 3D object detection (M3OD) is pivotal for safety-critical applications like autonomous driving, yet its reliability deteriorates significantly under real-world domain shifts caused by environmental or sensor variations. To address these shifts, Test-Time Adaptation (TTA) methods have emerged, enabling models to adapt to target distributions during inference. While prior TTA approaches recognize the positive correlation between low uncertainty and high generalization ability, they fail to address the dual uncertainty inherent to M3OD: semantic uncertainty (ambiguous class predictions) and geometric uncertainty (unstable spatial localization). To bridge this gap, we propose Dual Uncertainty Optimization (DUO), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD. Through a convex optimization lens, we introduce an innovative convex structure of the focal loss and further derive a novel unsupervised version, enabling label-agnostic uncertainty weighting and balanced learning for high-uncertainty objects. In parallel, we design a semantic-aware normal field constraint that preserves geometric coherence in regions with clear semantic cues, reducing uncertainty from the unstable 3D representation. This dual-branch mechanism forms a complementary loop: enhanced spatial perception improves semantic classification, and robust semantic predictions further refine spatial understanding. Extensive experiments demonstrate the superiority of DUO over existing methods across various datasets and domain shift types. The source code is available at https://github.com/hzcar/DUO.

1 Introduction

Monocular 3D object detection (M3OD) [33, 1, 34] serves as a fundamental perception task, enabling agents to understand 3D scenes directly from 2D images, as shown in Fig. 1(a). Due to its low cost and simple hardware configuration, M3OD has attracted widespread attention, leading to the development of numerous detectors [7, 52, 31]. However, when facing adverse weather conditions or sensor failures in real-world deployments, well-trained models often suffer severe performance degradation under data shifts [2, 14, 49]. Therefore, it is crucial to deal with the OOD generalization problem for M3OD.

To address the distribution shifting issue with minimal overhead, Test-Time Adaptation (TTA) has emerged as a critical paradigm, enabling source models to adapt to target distributions via online updates [19, 17, 23]. The dominating strategy in this context involves minimizing prediction entropy, thereby reducing the uncertainty of the model on shifted data [46, 17]. Despite the promising results, existing TTA approaches largely overlook the dual uncertainty inherent in 3D detection—semantic uncertainty (related to class predictions) [50, 42] and geometric uncertainty (related to spatial location) [30, 39], which is a significant difference compared to conventional 2D tasks.

To investigate these two types of uncertainty under test-time shifts, we analyze detection outcomes for objects subject to common real-world variations. As shown in Fig. 1 (b)&(c), our empirical study reveals that both uncertainties increase markedly with data shifts, creating compounded error accumulation in 3D detection. Moreover, we find that existing uncertainty optimization techniques exhibit significant limitations:

Low-score object neglect. Entropy minimization fails to provide effective supervision for challenging objects with low detection scores, resulting in inevitable omissions.
Spatial perception collapse. Direct minimization of depth uncertainty can cause model collapse, compromising the perception capacity of spatial attributes.

In this work, to overcome the above limitations, we propose Dual Uncertainty Optimization (DUO), the first TTA framework for joint semantic-geometric uncertainty minimization. Specifically, through the lens of convex optimization theory [3, 45], we present an Legendre–Fenche structure of the focal loss [25] and reconstruct the semantic uncertainty minimization as a dual optimization problem. Building on this foundation, we apply higher-order approximation analysis to derive a novel Conjugate Focal Loss. This loss breaks the label-dependence barrier in the original objective and introduces a dynamic weighting mechanism for balanced training, effectively reducing the omissions of objects with low scores. In parallel, we introduce a normal field constraint that enforces local consistency of surface normals in regions with high semantic confidence. This spatial coherence clarifies geometric cues, thereby reducing geometric uncertainty. Together, this dual-branch design creates a complementary loop where semantic confident regions bootstrap geometric feature learning, while enhanced spatial perception guides semantic refinement.

We evaluate the effectiveness and generalizability of our method through experiments on the KITTI dataset [11] with 13 corruption shift types, achieving state-of-the-art results with average improvements of +2.2 ${\rm AP}_{{\rm 3D}|{\rm R}_{40}}$ in the Car category. Furthermore, we showcase its superior performance in addressing real-world shift scenarios (daytime $\leftrightarrow$ night, sunny $\leftrightarrow$ rainy) of nuScenes dataset [5], yielding an average gain of +18% compared to existing methods.

Contributions:

To the best of our knowledge, we pioneer dual uncertainty optimization in M3OD by establishing the first TTA framework that jointly minimizes semantic and geometric uncertainties, addressing a critical reliability gap in real-world deployments.
Through the lens of convex optimization theory, we derive a novel Conjugate Focal Loss that enables label-agnostic uncertainty weighting and balanced learning for low-score objects. This approach is inherently compatible with the source phase, requiring no additional hyperparameter tuning.
We introduce a normal field constraint that enforces the stability of geometric representation with semantic guidance, resolving ambiguous spatial predictions.
We analyze and verify that our dual-branch design creates a complementary loop of two types of uncertainty optimization, resulting in significantly improved performance over existing TTA methods.

2 Related Work

Monocular 3D Object Detection (M3OD) aims to perceive 3D objects from a single 2D image. Existing methods in M3OD can be broadly categorized based on their use of extra data sources, such as CAD models [28], dense depth maps [8, 47], or LiDAR [41, 18]. In this paper, we focus exclusively on approaches that utilize only monocular images, due to their computational efficiency and lower deployment costs. Previous studies, such as MonoDLE [32] and PGD [48], have identified depth estimation as a critical bottleneck in M3OD. To address this, many works leverage multiple geometric cues to integrate diverse depth predictions. For example, MonoFlex [55] integrates depth prediction by combining direct regression with multi-keypoint estimation; MonoGround [40] incorporates the ground plane as prior information; and MonoCD [53] exploits the complementary properties of multi-head estimation. In this paper, we investigate how to enhance the detection performance of such detectors under test-time shifts.

Test-Time Adaptation (TTA) aims to enhance model performance on out-of-distribution samples during inference. Depending on whether the source training process is modified, existing TTA methods can be mainly divided into two groups: Test-Time Training [44, 13, 27] and Fully Test-Time Adaptation [4, 56, 16]. In this paper, we focus on Fully TTA, which adapts models without source data. Prior studies have demonstrated that reducing prediction uncertainty is an effective strategy for improving model generalization in various tasks [26, 10, 21, 51]. These works have developed various strategies to model and optimize uncertainty. For instance, SAR [37] uses entropy as a measure of classification uncertainty along with sharpness-aware optimization. DeYO [22] incorporates entropy with a disentangled factor for uncertainty modeling. ReCAP [17] models regional uncertainty, enforcing implicit data scaling for uncertainty optimization. MonoTTA [24] optimizes positive and negative class uncertainties separately. However, unlike conventional 2D tasks, M3OD inherently exhibits both semantic and geometric uncertainties, which remain largely unexplored and lack an effective training paradigm. In this work, we explore the joint optimization of this dual uncertainty, enhancing the robustness of M3OD models.

3 Preliminary

Task Definition. M3OD aims to predict the 3D and semantic attributes of objects from a single RGB image. Given an input image $\text{I}\in\mathbb{R}^{H\times W\times 3}$ , the goal is to generate accurate 3D bounding boxes $\{\mathcal{B}_{i}\}_{i=1}^{N}$ and semantic labels $\{\mathcal{C}_{i}\}_{i=1}^{N}$ for objects present in the scene, where $N$ denotes the number of objects. Each bounding box is typically parameterized as $\mathcal{B}_{i}=(\mathbf{P}_{i},\,\mathbf{D}_{i},\,\mathbf{O}_{i})$ , where $\mathbf{P}_{i}\in\mathbb{R}^{3}$ denotes the 3D center position, $\mathbf{D}_{i}\in\mathbb{R}^{3}$ represents the shape dimensions, and $\mathbf{O}_{i}\in[-\pi,\pi]$ encodes the orientation. The multi-task nature of M3OD, which involves simultaneous estimation of both geometric and semantic attributes, poses significant challenges for achieving precise and coherent predictions.

Meta-framework. As shown in Fig. 3(a), M3OD models typically employ a backbone network connected with multiple branches to predict various properties, e.g., score heatmaps, depth maps, etc. Since depth estimation is widely recognized as a key bottleneck [32, 48], many approaches employ a multi-head depth estimator to improve prediction accuracy. This estimator comprises a direct regression depth head and multiple geometric depth heads, each providing an individual depth prediction along with an associated uncertainty value. These outputs are then fused via an uncertainty-weighted average to produce the final depth prediction. We utilize the average uncertainty across all heads as our depth uncertainty metric. A detailed explanation is provided in the Appendix. D.2.

TTA Setting. TTA addresses the challenge of distribution shifts by enabling a pre-trained model to adapt to the target distribution during inference, without the need for labeled data. Unlike traditional training on fixed datasets, TTA operates in an online manner, where the model $h_{\theta}$ with parameter set $\theta$ produces detection outputs while concurrently updating its parameters based on incoming test data:

[TABLE]

where $\text{I}_{t}$ denotes the incoming test data and $\mathcal{L}_{tta}$ denotes the loss function used for self-training during adaptation.

4 Uncovering Dual-Uncertainty under Shifts

Since TENT [46] revealed the positive correlation between prediction uncertainty and generalization error under distribution shifts, numerous TTA methods have emerged to optimize uncertainty metrics. While previous works have developed diverse measures for conventional 2D tasks [37, 22], the compound semantic and geometric uncertainties in 3D perception remains critically unexplored in the TTA field. In this work, we investigate both types of uncertainty within the context of M3OD to understand their distinct roles.

Specifically, to quantify the variation of uncertainty under distribution shifts, we track two metrics: semantic prediction entropy and geometric depth uncertainty (average uncertainty of multi-head depth estimator). As shown in Fig. 2(a), both metrics demonstrate a consistent upward trend as distribution shifts intensify, indicating that more severe shifts lead to higher model uncertainty. Furthermore, we analyze the independent optimization effects of two uncertainties and make the following key observations:

Observation 1: Conventional entropy minimization exacerbates imbalance distribution of detection scores. Different from classification scenarios, object detection suffers from extreme foreground-background imbalance, which hinders the effective optimization of hard positive objects [25, 6]. As shown in Fig. 2(b)&(c), entropy minimization yields a marginal gain for low-score objects while significantly boosting high-score predictions, further exacerbating the imbalance and leading to omissions of low-score objects.

Observation 2: Minimizing depth uncertainty directly causes the model collapse of the multi-head depth estimator. To minimize depth uncertainty, we apply the uncertainty regression loss across multiple depth heads (detailed in the Appendix. D.2). As shown in Fig. 2(d), the regression head which lacks any geometric constraints, exhibits a significantly faster decline of uncertainty compared to keypoint heads. This rapid convergence reduces the multi-head system to a single deterministic head, undermining the model’s ability to perform robust spatial understanding.

5 Methodology

Based on the above observations, we propose a TTA method Dual Uncertainty Optimization (DUO) for M3OD, which mainly leverages two novel designs, i.e., conjugate optimization framework and normal consistency constraint, to compatibly optimize semantic and geometric uncertainties.

5.1 Semantic Uncertainty Conjugate Optimization

To address the imbalance problem, it is crucial to devote more attention to challenging objects with high uncertainty. A straightforward approach is to employ the focal loss [25] which increases the weight of predictions with low probability. However, focal loss cannot assign an appropriate weight without the ground-truth label and fails to adapt effectively in unsupervised settings. To provide a more flexible weighting solution, we leverage convex optimization theory [3, 45] (a classical analytical tool) to explore the focal loss via a Legendre–Fenchel structure. Building on this foundation, we perform a higher-order approximation analysis to derive a novel loss that can dynamically adjust the weighting without relying on labels, as shown in Fig. 3(b).

Legendre–Fenchel Structure. Given the source model $h_{\theta}$ pre-trained with the focal loss, we denote $h\triangleq h_{\theta}(x)$ and its corresponding loss function can be formulated as:

[TABLE]

where $p\triangleq\texttt{softmax}(h)$ is the normalized probability over the classes, $x$ is a detected object, and $y$ is the one-hot coding of the ground-truth label. $\alpha$ and $\gamma$ are hyperparameters controlling the weighting of uncertain predictions. Motivated by [12], we reformulate it as the following structure:

[TABLE]

where $s\triangleq e^{h_{1}}+...+e^{h_{\text{c}}}$ is the sum over the exponential outputs of the model and c is the number of classes. Refer to the Appendix. A for missing proofs of this subsection.

Problem Reconstruction. Under this structure, the optimization problem can be regarded as finding an optimal representation $h$ that minimizes the empirical loss. According to the Legendre–Fenchel condition [45], the existence of the conjugate function $f^{*}$ is equivalent to the invertibility of the function $g$ . This critical condition is formally established in our work (see Appendix. A), serving as a fundamental theoretical guarantee for our method. Therefore, the minimum value of the objective can be expressed as follows:

[TABLE]

Building on the common assumption that the representation $h$ pre-trained from the large source dataset is already close to a locally optimal solution $h_{0}$ [9, 35], we can convert the problem into the following relationships:

[TABLE]

Conjugate Focal Loss. By applying the chain rule of gradient computation and higher-order approximation, we can obtain the following estimation:

[TABLE]

where $\text{diag}(\cdot)$ denotes the diagonal matrix and $I$ denotes the identity matrix. This estimation $y_{0}$ ensures that the loss aligns closely with the conjugate function. Ultimately, we substitute it into Equ. 5, yielding an unsupervised approximation of the conjugate function:

[TABLE]

In contrast to the focal loss in Equ. 2, our derived Conjugate Focal Loss (CFL) offers a dynamic adjustment of uncertainty weighting without relying on ground-truth labels and it confers the following advantages:

Static vs. Dynamic Adjustment. The vanilla focal loss uses a fixed weighting term for the ground-truth class to focus on high-uncertainty predictions. In contrast, our CFL not only incorporates $(1-p)^{\gamma}$ to address the imbalance training issue but also dynamically adjusts the weighting across all classes based on the term $(I+\gamma(1-\log p)pp^{\top}-\gamma\log p\cdot\text{diag}(p))^{-1}$ , which encodes inter-class prediction relationships.

Ground-Truth Independence. While focal loss requires labels to compute the weight for the loss term, CFL operates solely based on the prediction probability, eliminating the need for labeled data during TTA.

Compatible Hyperparameters with Source Phase. Our theoretical analysis suggests that hyperparameters $\alpha,\gamma$ should remain consistent with their values used in focal loss during the source training. This compatibility provides a practical advantage for TTA scenarios, eliminating the need for extensive hyperparameter tuning. The validity of this consistent setting is empirically verified in Appendix. C.

5.2 Semantic-Guided Normal Field Constraint

Despite progress in uncertainty modeling, current methods struggle to handle geometric uncertainty in TTA scenarios due to critical limitations:

Model Collapse: Direct optimization of model-predicted uncertainty leads to the degenerated predictor, as discussed in Sec. 4.
Substantial Overhead: Geometry-aware methods typically require additional data or offline training for uncertainty estimation, limiting their real-time applicability [39, 15]. To overcome these challenges, we propose an efficient normal field constraint derived from a single image, which enhances the geometric coherence of 3D predictions, thereby reducing the uncertainty stemming from the unstable geometric representation, as shown in Fig. 3(c).

Normal Field. Given a depth map $D$ , we restore it to the original image resolution using bilinear interpolation, allowing for the alignment with the pixel grid. Then, we compute the spatial gradients of the depth map using efficient Sobel operators [20], which approximate the rate of change in the depth values along horizontal and vertical directions:

[TABLE]

where $\mathbf{S}_{x}$ and $\mathbf{S}_{y}$ denote horizontal and vertical Sobel kernels, respectively. These gradient maps capture the variation in depth across neighboring pixels. The surface normal field, which encodes the orientation of the surface at each pixel, is then derived from the gradients as:

[TABLE]

where $\mathbf{N}(u,v)$ denotes the normal orientation at pixel $(u,v)$ .

Normal Consistency Loss. To quantify geometric uncertainty, we employ an edge-aware Normal Consistency Loss (NCL), that encourages smoothness in the local surface by penalizing inconsistencies between neighboring pixels:

[TABLE]

where the smoothness terms $\psi_{x}(u,v),\psi_{y}(u,v)$ enforce horizontal and vertical consistency in the surface normal field. The total normal consistency loss is then given by:

[TABLE]

where the edge-aware weighting term $\exp(-\|\nabla\text{I}(u,v)\|_{2})$ $=\exp(-\sqrt{|\mathbf{S}_{x}\ast\text{I}(u,v)|^{2}+|\mathbf{S}_{y}\ast\text{I}(u,v)|^{2}})$ preserves discontinuities at boundaries while enforcing smoothness in homogeneous regions. NCL encourages the model to learn a spatially consistent normal field, thereby reducing uncertainty stemming from unstable 3D representations.

Semantic Guidance. To enforce geometric–semantic coherence, we generate masks by integrating 2D bounding boxes with semantic predictions, ensuring synchronized focus regions for dual-branch uncertainty optimization. Let $\{\mathcal{B}_{i}\}_{i=1}^{n}$ denote detected bounding boxes with scores $\{s_{i}\}_{i=1}^{n}$ and semantic uncertainties $\{U_{i}\}_{i=1}^{n}$ derived from the Conjugate Focal Loss ( $\mathcal{L}_{\text{CFL}}$ in Equ. 7). To ensure reliable supervision, we select boxes with low semantic uncertainty using an exponentially moving average threshold:

[TABLE]

where $\beta\in[0,1]$ is a moving average factor (set to 0.1 in default). We then construct the region mask as follows:

[TABLE]

where $\mathbb{I}_{\text{inside}}(u,v\mid\mathcal{B}_{i})$ is an indicator function that returns 1 if pixel $(u,v)$ is inside $\mathcal{B}_{i}$ , and 0 otherwise. This semantic-guided mask ensures that only low semantic-uncertainty regions contribute to the normal field constraint, enhancing the reliability of the geometric constraint.

5.3 Overall Objective

The overall objective integrates the conjugate focal loss with the normal consistency constraint into a unified framework, simultaneously addressing semantic and geometric uncertainties. This dual optimization enables a complementary feedback loop: low-uncertainty spatial location enhances the model’s ability to perform precise semantic classification, while confident semantic predictions in turn improve spatial understanding. The procedure is as follows:

[TABLE]

where $x$ iterates over all detected objects and $(u,v)$ spans all pixels in the image I. $\mathcal{L}_{\text{CFL}}$ , $\mathcal{L}_{\text{NCL}}$ , and $\mathcal{M}$ are defined in Equ. 7, Equ. 11, and Equ. 13. $\lambda=0.7$ by default.

6 Experiments

6.1 Experimental Setup

For a fair comparison, we follow the identical evaluation pipeline with prior work [24], including baseline models, training recipes, and evaluation protocols.

Datasets. We conduct experiments on the KITTI [11] and nuScenes [5] datasets. For KITTI, we use the KITTI-C version, which includes 13 distinct data corruption types [14] and five severity levels per type. Results represent the average performance across three difficulty levels, i.e. Easy, Moderate, and Hard. Note that we also provide more results of different severity levels of KITTI in Appendix. B.1. For nuScenes, we adopt the front-view images and construct the Daytime, Night, Sunny, and Rainy scenarios via their scene descriptions following [29]. Based on these scenes, we define 4 real-world adaptation tasks, i.e., Daytime $\leftrightarrow$ Night and Sunny $\leftrightarrow$ Rainy. Following the MonoTTA, we transfer the nuScenes dataset into the KITTI format. More details are provided in Appendix. D.3.

Compared Methods. Based on two representative base models MonoFlex [55] and MonoGround [40], we compare our DUO with several state-of-the-art methods: EATA [36] identifies reliable samples during entropy training. DeYO [22] leverages probability variations under augmentations as an additional cue to enhance entropy optimization. MonoTTA [24] boosts probabilities for high-score classes while applying negative learning to low-score classes.

Implementation Details. We implement our method and other baselines in PyTorch [38]. We employ the Stochastic Gradient Descent (SGD) optimizer with the same learning rate as MonoTTA, a momentum of 0.9, and a batch size of 16 for KITTI, 4 for nuScenes. Parameters $\lambda$ , $\alpha$ , $\gamma$ are assigned default values of 0.7, 4, and 2, respectively.

Evaluation Protocols. We report experimental results using the Average Precision for 3D bounding boxes, denoted as $\text{AP}_{3D|R_{40}}$ [43]. Results represent the mean values across three difficulty levels, with Intersection over Union thresholds set to 0.5 for Cars and 0.25 for both Ped. and Cyc.

6.2 Main Results

Evaluation on Corruption Shifts. We first compare our DUO with previous methods on the KITTI-C dataset at severity level 5. Due to space constraints, detailed comparisons with other severity levels are provided in the Appendix. B. The results, reported in Tab. 1 & 2, reveal several observations: 1) Under test-time distribution shifts, pre-trained detectors experience significant performance degradation across all categories. 2) Existing TTA methods partially mitigate the adverse effects of distribution shifts in M3OD, but their performance remains suboptimal as they cannot address the dual uncertainty. 3) Our method consistently outperforms compared approaches across all categories and base models. Notably, the proposed DUO achieves the best or comparable performance under 13 types of corruptions, with performance gains of +2.1 and +2.2 ${\rm AP}_{{\rm 3D}|{\rm R}_{40}}$ in the Car category for two different models. These observations validate the crucial role of our adaptation framework in addressing dual uncertainty in M3OD models and enhancing robustness against distribution shifts.

Evaluation on Real-World Scenario. We further evaluate different methods on four real-world scenarios, as shown in Tab. 3. The experimental results yield the following observations:

Under real-world shifts, the pre-trained model still suffers severe performance degradation.
For Sunny $\leftrightarrow$ Rainy adaptation tasks on MonoFlex, existing methods achieve only marginal improvements, whereas our method boosts performance by +4.7 ${\rm AP}_{{\rm 3D}|{\rm R}_{40}}$ .
Our DUO brings significant performance improvement on two base models, maintaining the best performance across all four scenarios, further demonstrating its effectiveness and superiority.

6.3 Ablation Study

In this section, without loss of generality, we conduct ablation studies on Monoflex under Gaussian shift in KITTI-C for the sake of brevity. Focusing on two pivotal components of DUO, Conjugate Focal Loss and Normal Field Constraint, we perform extensive experiments to analyze their independent roles and complementary effects, gaining insights into key factors contributing to their effectiveness.

Effectiveness of Components. We investigate the impact of individual components by comparing the full method with variations that omit key parts. As shown in Tab. 4, incorporating either the conjugate focal loss or the normal field constraint significantly enhances detection performance, yielding average gains of $+8.9$ and $+7.6$ ${\rm AP}_{{\rm 3D}|{\rm R}_{40}}$ , respectively. Notably, using the normal consistency loss ( $\mathcal{L}_{\text{NCL}}$ ) alone results in unstable, marginal improvements; its effectiveness depends on being combined with $\mathcal{M}$ , highlighting the necessity of semantic guidance for effective geometric uncertainty reduction. These findings underscore the effectiveness of each component.

Furthermore, a comparison of score distributions in Fig. 4(a) and Fig. 2(c) shows that the conjugate focal loss consistently boosts detection scores, ensuring effective training for challenging low-score objects. Similarly, Fig. 4(b) versus Fig. 2(d) demonstrates that our normal field constraint consistently reduces geometric uncertainty across all heads, preventing the model collapse observed in baseline methods. Therefore, these innovations effectively overcome the limitations of existing methods identified in Sec. 4, leading to more robust and reliable detection.

Complementary Effects. As shown in Tab. 4, our complete method, which integrates all components, consistently achieves the best detection performance, demonstrating the compatibility of different components. To further investigate the complementary effects of our dual-branch design, we visualize the dual-uncertainty optimization process. Specifically, comparing Fig. 5(a)&(b), we observe that applying the conjugate focal loss significantly reduces semantic uncertainty, while interestingly, geometric uncertainty also shows a modest decline. A similar phenomenon is evident when employing the normal field constraint in Fig. 5(c). This synchronous behavior suggests an intrinsic interdependence between the two types of uncertainty, validating the potential of joint optimization.

Moreover, by combining two innovations in our DUO framework, it achieves the fastest and most pronounced decrease in two uncertainties compared to individual components, as shown in Fig. 5(d). These observations further validate that our dual-branch architecture creates a complementary loop for dual-uncertainty optimization, effectively harnessing their complementary effects.

Robustness and Efficiency. To validate the robustness and efficiency of our method, we extend our analysis by examining the sensitivity of key hyper-parameters and comparing running times in Appendix. C.

6.4 Qualitative Results

Based on the qualitative results shown in Fig. 6, our DUO framework produces predictions that are more precisely aligned with the ground-truth annotations. Compared with the latest SOTA method, DUO not only significantly improves the accuracy of 3D location estimation but also reduces missed detections for challenging distant or small-scale objects, which are typically prone to large geometric errors. These visualizations further confirm that our dual uncertainty optimization effectively adapts the source model, achieving precise and reliable 3D perception that closely matches the ground truth.

7 Conclusion and Future Work

In this paper, we propose a synergistic dual-branch framework designed to address semantic-geometric uncertainty inherent in 3D vision systems under test-time domain shifts. Our approach derives a novel conjugate loss to offer an adaptive, label-free weighting mechanism for balanced training on semantic uncertainty. Meanwhile, it incorporates a normal consistency constraint to reduce uncertainty from inconsistent geometric representation. Our extensive experiments demonstrate that the proposed dual-branch optimization creates a complementary loop, consistently improving performance across diverse domain shifts. In future work, we tend to expand our framework beyond M3OD to cover a broader range of 3D vision tasks. We hope that our work will deepen the understanding of uncertainty-aware model adaptation while providing transferable insights for related fields, such as unsupervised learning.

Acknowledgements

This work was supported by the Program of Beijing Municipal Science and Technology Commission Foundation (No.Z241100003524010), in part by the National Natural Science Foundation of China under Grant 62088102, in part by AI Joint Lab of Future Urban Infrastructure sponsored by Fuzhou Chengtou New Infrastructure Group and Boyun Vision Co. Ltd, and in part by the PKU-NTU Joint Research Institute (JRI) sponsored by a donation from the Ng Teng Fong Charitable Foundation.

Appendix A Theoretical Proof

Below, we provide detailed proofs of the theoretical results presented in Sec. 5.1 of the main paper.

Notation. First, we recall the notation that we used in the main paper as well as this appendix: $x$ denotes an inputting test image and $y$ denotes the one-hot coding of the ground-truth label. $h_{\theta}$ denotes the model with its parameter set $\theta$ and $h\triangleq h_{\theta}(x)$ . $s\triangleq e^{h_{1}}+...+e^{h_{\text{c}}}$ is the sum over the exponential outputs of the model and c is the number of classes. $p\triangleq\texttt{softmax}(h)$ is the normalized probability over the classes. $\text{diag}(\cdot)$ denotes the diagonal matrix and $I$ denotes the identity matrix. We define the following two functions: $f(h)=\alpha\log s$ , $g(h)=\alpha h+\alpha((1-p)^{\gamma}-1)\log p$ .

A.1 Legendre-Fenchel Structure

In this subsection, we demonstrate the equivalence of the vanilla focal loss with its Legendre-Fenchel structure.

[TABLE]

Duo to $y$ is the one-hot vector, we have $y\log s=\log s$ and we further derive:

[TABLE]

Therefore, the focal loss can be formulated as a classical convex conjugate structure, allowing for further analysis in the conjugate optimization framework.

A.2 Problem Reconstruction.

In this section, we demonstrate the invertibility of function $g$ to ensure the existence of a conjugate function and further reformulate the optimization problem into conjugate relationships. by the inverse function theorem, the local invertibility of $g$ is guaranteed if its Jacobian is non-singular. For simplicity, we demonstrate the positive definiteness of the Jacobian under the default setting $\gamma=2$ :

Step 1: Jacobian of $g$ . Taking the gradient of $g$ with respect to $h$ , we obtain:

[TABLE]

where $D=\text{diag}(p-2-2(1-p)\log p)$ and $H=\text{diag}(p)-pp^{\top}$ .

Step 2: Positive Semidefiniteness of $H$ . For any vector $\mathbf{v}\in\mathbb{R}^{C}$ , the quadratic form of $\mathbf{H}$ is:

[TABLE]

Let $V$ be a random variable that takes the value $v_{i}$ with probability $p_{i}$ . The quadratic form simplifies to:

[TABLE]

Since variance is always non-negative, we have:

[TABLE]

Thus, $\mathbf{H}$ is positive semi-definite. Furthermore, using Gershgorin’s circle theorem, all eigenvalues of $H$ satisfy:

[TABLE]

where $2p_{i}\left(1-p_{i}\right)$ attains its maximum $\frac{1}{2}$ when $p_{i}=\frac{1}{2}$ .

Step 3: Bounding the Eigenvalues of $\nabla_{h}g$ . Since $D$ is diagonal, the matrix $D\cdot H$ can be seen as a row-scaled version of $H$ . And the element of $H$ follows:

[TABLE]

Numerical analysis of the function $\phi(p_{i})=p_{i}-2-2\left(1-p_{i}\right)\log p_{i}$ shows $D_{ii}\textgreater-2$ for $p_{i}\in\left(0,1\right)$ . Therefore, the eigenvalues of $\nabla_{h}g$ satisfy:

[TABLE]

Since $\lambda_{\min}(H)\geq 0$ and $D_{ii}\textgreater-2$ , we can obtain a tighter bound:

[TABLE]

Step 4: Conclusion. As all eigenvalues of $\nabla_{h}g$ are non-negative, the Jacobian is positive semidefinite. In particular, as long as $p$ is not degenerate (i.e., no prediction has probability exactly 0 or 1), $H$ is full rank on its subspace, ensuring that $\nabla_{h}g$ is non-singular in a neighborhood of $h$ . Thus, by the inverse function theorem, $g$ is locally invertible.

This invertibility guarantees the existence of a conjugate function $f^{*}$ , which equals to the minimization value of the objective:

[TABLE]

Under the common assumption that the representation $h$ pre-trained from the large source dataset is already close to a local optimal solution $h_{0}$ , we can convert the problem into the following conjugate relationships:

[TABLE]

A.3 Conjugate Focal Loss

In this subsection, we need to derive the estimation of $y$ from the conjugate conditions in Equ. 26. For the derivative of $g$ , we differentiate the two summation components separately:

[TABLE]

For the first part, we have $\nabla_{h}h=I$ . For the second part, we utilize the chain rule to derive the following format:

[TABLE]

We calculate the $\nabla_{p}\phi$ as follows:

[TABLE]

We calculate the $\nabla_{h}p$ as follows:

[TABLE]

Thus, the Jacobian of $g$ is given by

[TABLE]

For the composite function $f\circ g^{-1}$ , the chain rule gives

[TABLE]

Evaluating at $h_{0}$ (with $p=\operatorname{softmax}(h_{0})$ ) leads to

[TABLE]

And we utilize the Taylor’s Formula to approximate the value (neglecting higher-order terms of $p$ ):

[TABLE]

After algebraic manipulation (and neglecting higher-order terms in $p$ and $\log p$ ), we obtain

[TABLE]

Thus, by applying the chain rule and approximating the Jacobian of $g$ to higher order, we obtain

[TABLE]

Finally, we substitute this estimation into Equ. 26, yielding Conjugate Focal Loss:

[TABLE]

Appendix B Further Experiments

In this section, we broaden our investigation by evaluating our method across a variety of shift severity levels. To this end, we conduct experiments on corruption scenarios with shift level 1, 3, 5. These experiments allow us to thoroughly examine the robustness of our approach under different severity levels of distribution shifts.

B.1 Different Severity Level Corruption

We further provide more discussions surrounding high-severity data corruptions (i.e. 3 and 1) based on the experimental results shown in Tab. 5&6, which clearly gives additional observations:

With the escalation of severity level, the source models suffer a larger performance decline within various corruptions. For instance, the pre-trained models of MonoFlex and MonoGround achieve obvious performance drop from level 1 to level 3, which significantly heightens the challenge for test-time adaptation.
Existing TTA methods struggle to recover performance under such extreme corruptions, highlighting the limitations of conventional uncertainty optimization approaches.
Despite these challenges, our DUO framework consistently achieves the best average performance across all corruption types. This robust performance demonstrates that our dual uncertainty optimization framework effectively stabilizes both the semantic classification and spatial perception branches, providing reliable adaptation even under different shift levels.

Appendix C Additional Ablation Study

In Sec. 6.3 of the main paper, we provide a comprehensive analysis of each component’s effectiveness and their complementary interactions. In this section, we extend our analysis by examining the sensitivity of key parameters and comparing running times, offering further insights into the robustness and efficiency of our method.

C.1 Hyperparameter Robustness

Our method involves two key hyperparameters: the coefficient $\lambda$ , which determines the trade-off of the semantic and geometric uncertainty optimization, and the coefficient $\alpha$ , which controls the weighting scale in conjugate focal loss. We conduct ablation experiments on these two key coefficients independently:

As shown in Fig. 7(a), the different strengths of geometric constant yields stable performance gains. However, when $\lambda$ exceeds the optimal range (e.g., 1.1), the model tends to over-prioritize geometric consistency over uncertainty optimization, leading to worse performance. To balance the effects of two components in our method, we set $\lambda$ to 0.7 by default. In Fig. 7(b), we observe that the weighting coefficient $\alpha$ consistently outperforms prior SOTA methods, demonstrating the robustness of our Conjugate Focal Loss weighting scheme. Empirically, we set $\alpha$ to 4 by default.

Notably, the default choice of $\alpha$ and $\gamma$ not only yields strong empirical performance but also aligns with the standard settings used in vanilla focal loss during source training. This compatibility echoes our theoretical analysis in Sec.5.1, suggesting that hyperparameters can remain unchanged from the source phase. Such consistency removes the need for extensive hyperparameter tuning, significantly improving the efficiency and practicality of our adaptation strategy.

C.2 Running Time Comparison

In our experiments, we have demonstrated the effectiveness of DUO in various scenarios. In this subsection, we focus on the computational efficiency of DUO. Although our method relies on dual-branch optimization, the computation of the geometric constraint with efficient operators incurs only a slight time cost. As shown in Table 8, DUO’s running time is less than half that of DeYO, which relies heavily on data augmentation to optimize the uncertainty. Moreover, DUO’s adaptation efficiency exceeds that of MonoTTA, which only optimizes semantic uncertainty without addressing geometric uncertainty. Notably, processing 1k images with DUO adds only an extra 6 seconds compared to inference alone, underscoring the high efficiency of our dual-branch optimization framework.

Appendix D More Implementation Details

D.1 Baseline Methods

We compare our DUO with several state-of-the-art methods. TENT [46] reduces the entropy of test samples to guide model updates, prompting the model to generate more confident predictions. Building on this, EATA [36] incorporates a sample selection mechanism based on low uncertainty to specifically minimize entropy for the most reliable samples, thereby further reducing semantic uncertainty. DeYO [22] prioritizes samples with dominant shape information and employs a dual semantic uncertainty criterion to identify reliable samples for adaptation. MonoTTA [24] introduces a negative regularization term on low-score objects, leveraging their negative class information to reduce uncertainty.

D.2 Detailed Model Architecture

Our framework is built on a widely-adopted multi-branch architecture for monocular 3D object detection, where separate branches predict various object properties to simultaneously achieve recognition and spatial localization. In 3D detection, accurate depth estimation is a critical factor that significantly influences overall performance [32]. To enhance depth prediction, many existing models adopt a multi-head strategy that integrates diverse depth estimates to reduce the individual bias. For example, MonoFlex [55] combines direct regression with multiple keypoint estimation; MonoGround leverages ground plane priors for refined depth predictions [40]; and MonoCD exploits the complementary strengths of multiple prediction heads [53].

To effectively integrate the multi-head predictions, these models includes an uncertainty estimation branch that quantifies the reliability of each depth prediction. The final depth estimation is computed as an uncertainty-weighted average, as shown in the following formulation:

[TABLE]

where the $\sigma_{i}$ is the uncertainty of the corresponding depth estimation and $n$ is the number of depth heads. In the main paper, we use the average of the $\log\sigma_{i}$ as the depth uncertainty metric.

Furthermore, the uncertainty regression loss for the entire depth branch is designed as:

[TABLE]

where the $z^{*}$ is the ground-truth depth.

For our TTA setting, we attempt to utilize the weighted average $z_{soft}$ as a pseudo-label to optimize this loss, directly optimizing the depth uncertainties. However, this approach can lead to model collapse—a phenomenon we analyze in detail in Sec. 4 of the main paper.

For Monoflex [55] and MonoGround [40], we follow their original settings by using a randomly generated seed. Both Monoflex and MonoGround employ the same modified DLA-34 [54] as their backbone network, with input resolutions of $384\times 1280$ for the KITTI-C and $928\times 1600$ for nuScenes, respectively.

D.3 More Details on Dataset

KITTI-C Dataset. We follow the protocol from [55, 24] to partition the KITTI dataset into a training set (3712 images) and a validation set (3769 images) for model training and adaptation, respectively. For evaluation, we employ the KITTI-C version, which applies 13 distinct corruptions to the validation set—namely, Gaussian noise, shot noise, impulse noise, defocus blur, glass blur, motion blur, snow, frost, fog, brightness, contrast, pixelation, and saturation [14], as shown in Fig. 8. Each corruption is further divided into five severity levels, with higher levels indicating more extreme perturbations and distribution shifts.

nuScenes Dataset. For the four real-world scenarios in the nuScenes dataset, we first extract all front-view images and convert them to KITTI format using the official devkit [5]. Following [29], we partition these images into Daytime, Night, Sunny, and Rainy scenarios based on their scene descriptions. For each scenario, we train our model on the training split and evaluate its performance on the validation split (the number of images per scenario is shown in Fig. 9). Since the Night scenario contains fewer than 4k images with fewer objects (e.g., pedestrians), we report results only for the Car category.

Bibliography56

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Arnold et al. [2019] Eduardo Arnold, Omar Y Al-Jarrah, Mehrdad Dianati, Saber Fallah, David Oxtoby, and Alex Mouzakitis. A survey on 3d object detection methods for autonomous driving applications. IEEE Transactions on Intelligent Transportation Systems , 2019.
2Ben-David et al. [2010] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning , 79:151–175, 2010.
3Bertsekas et al. [2003] Dimitri Bertsekas, Angelia Nedic, and Asuman Ozdaglar. Convex analysis and optimization . Athena Scientific, 2003.
4Boudiaf et al. [2022] Malik Boudiaf, Romain Mueller, Ismail Ben Ayed, and Luca Bertinetto. Parameter-free online test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8344–8353, 2022.
5Caesar et al. [2020] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 11621–11631, 2020.
6Chen et al. [2020 a] Joya Chen, Qi Wu, Dong Liu, and Tong Xu. Foreground-background imbalance problem in deep object detectors: A review. In 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR) , pages 285–290. IEEE, 2020 a.
7Chen et al. [2020 b] Yongjian Chen, Lei Tai, Kai Sun, and Mingyang Li. Monopair: Monocular 3d object detection using pairwise spatial relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 12093–12102, 2020 b.
8Ding et al. [2020] Mingyu Ding, Yuqi Huo, Hongwei Yi, Zhe Wang, Jianping Shi, Zhiwu Lu, and Ping Luo. Learning depth-guided convolutions for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition workshops , pages 1000–1001, 2020.