Perceptual deep depth super-resolution

Oleg Voynov; Alexey Artemov; Vage Egiazarian; Alexander Notchenko,; Gleb Bobrovskikh; Denis Zorin; Evgeny Burnaev

arXiv:1812.09874·cs.CV·September 10, 2019

Perceptual deep depth super-resolution

Oleg Voynov, Alexey Artemov, Vage Egiazarian, Alexander Notchenko,, Gleb Bobrovskikh, Denis Zorin, Evgeny Burnaev

PDF

1 Repo

TL;DR

This paper presents a deep learning-based method for super-resolving depth maps by optimizing visual appearance in 3D renderings, leading to improved 3D shape quality for applications like virtual reality.

Contribution

It introduces a novel perceptual loss based on 3D surface renderings for depth super-resolution, enhancing shape quality over existing methods.

Findings

01

Significant improvement in 3D shape quality using the proposed perceptual loss.

02

Effective use of deep prior and CNN-based models for depth upsampling.

03

Outperforms existing optimization and learning-based techniques in perceptual metrics.

Abstract

RGBD images, combining high-resolution color and lower-resolution depth from various types of depth sensors, are increasingly common. One can significantly improve the resolution of depth maps by taking advantage of color information; deep learning methods make combining color and depth information particularly easy. However, fusing these two sources of data may lead to a variety of artifacts. If depth maps are used to reconstruct 3D shapes, e.g., for virtual reality applications, the visual quality of upsampled images is particularly important. The main idea of our approach is to measure the quality of depth map upsampling using renderings of resulting 3D surfaces. We demonstrate that a simple visual appearance-based loss, when used with either a trained CNN or simply a deep prior, yields significantly improved 3D shapes, as measured by a number of existing perceptual metrics. We…

Figures18

Click any figure to enlarge with its caption.

Tables11

Table 1. Table 1: Quantitative evaluation on SimGeo dataset. RMSE d subscript RMSE d \mathrm{RMSE_{d}} is in millimeters, other metrics are in thousandths. Lower values correspond to better results. The best result is in bold, the second best is underlined.

	Sphere and cylinder, x4				Lucy, x4				Cube, x4				SimGeo average, x4
	${RMSE}_{d}$	${DSSIM}_{v}$	${LPIPS}_{v}$	${RMSE}_{v}$	${RMSE}_{d}$	${DSSIM}_{v}$	${LPIPS}_{v}$	${RMSE}_{v}$	${RMSE}_{d}$	${DSSIM}_{v}$	${LPIPS}_{v}$	${RMSE}_{v}$	${RMSE}_{d}$	${DSSIM}_{v}$	${LPIPS}_{v}$	${RMSE}_{v}$
SRfS [14]	70	887	1025	417	82	811	781	367	52	934	1036	361	61	711	869	311
EG [54]	55	143	326	130	69	357	426	220	43	113	214	105	53	168	306	136
PDN [39]	157	198	295	150	173	456	368	251	164	156	250	145	162	224	278	165
DG [13]	56	265	372	166	69	523	558	249	44	218	411	139	54	293	420	171
Bicubic	57	189	313	189	72	355	398	267	44	131	287	160	55	197	320	193
DIP [47]	46	965	1062	548	53	827	615	344	45	963	906	530	52	887	893	395
MSG [22]	41	626	859	229	54	444	480	259	29	445	687	176	39	374	569	194
DIP-V	28	560	766	142	44	421	446	223	26	352	613	146	33	313	524	147
MSG-V	99	94	267	96	74	205	251	156	102	70	179	77	96	95	194	99

Table 2. Table 2: Quantitative evaluation on ICL-NUIM and Middlebury datasets. All metrics are in thousandths. Lower values correspond to better results. The best result is in bold, the second best is underlined.

	Plant						Vintage						Recycle						Umbrella
	${DSSIM}_{v}$		${LPIPS}_{v}$		${RMSE}_{v}$		${DSSIM}_{v}$		${LPIPS}_{v}$		${RMSE}_{v}$		${DSSIM}_{v}$		${LPIPS}_{v}$		${RMSE}_{v}$		${DSSIM}_{v}$		${LPIPS}_{v}$		${RMSE}_{v}$
	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8
SRfS [14]	658	692	632	649	280	309	721	749	631	634	346	382	715	772	610	623	376	410	843	853	797	831	397	443
EG [54]	568		677		255
PDN [39]	574	612	659	699	269	305	663	714	706	700	319	350	635	701	523	589	364	457	799	828	847	882	367	452
DG [13]	611	622	745	785	268	291	666	669	796	840	290	300	696	719	602	617	328	383	846	878	781	856	399	457
Bicubic	562	610	688	763	249	290	558	649	602	729	258	302	575	721	474	576	329	398	749	837	747	886	323	380
DIP [47]	919	880	764	723	490	437	953	965	910	872	656	687	871	923	576	605	434	500	915	953	737	722	467	528
MSG [22]	571	645	582	495	234	285	708	785	510	610	292	364	741	869	624	661	485	550	834	896	678	787	442	496
DIP-V	694	707	463	555	262	276	804	884	579	674	343	435	575	735	388	485	273	332	796	854	604	598	318	352
MSG-V	524	575	639	720	194	236	536	643	670	702	211	268	603	737	520	564	368	473	778	842	800	890	348	427

Table 3. Table 3: Quantitative evaluation summary. The best result is in bold, the second best in underlined.

	Average performance on SimGeo dataset
	${RMSE}_{d}$		${BadPix}_{d} (5 c m)$		${BadPix}_{v} (5)$		${DSSIM}_{v}$		${LPIPS}_{v}$		${Bumpiness}_{d}$		${RMSE}_{v}$		User, 1st	User, 2nd	Top 2
	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x4	x4
Bicubic	55	79	4.1	7.9	23.2	38.3	197	301	320	427	0.70	0.98	193	234	0.5	7.8	8.3
SRfS [14]	61	88	7.5	14.3	74.2	77.1	711	729	869	865	1.48	1.69	311	328	0.0	0.0	0.0
EG [54]	53		2.2		33.1		168		306		0.54		136		0.2	3.9	4.2
PDN [39]	162	211	99.4	99.1	39.2	45.1	224	264	278	407	0.63	0.79	165	201	1.6	12.3	13.8
DG [13]	54	84	3.0	6.4	35.2	39.1	293	316	420	437	0.69	0.82	171	190	0.2	2.9	3.2
DIP [47]	52	59	8.5	12.5	90.5	92.0	887	880	893	915	2.21	2.77	395	475	0.6	0.9	1.5
MSG [22]	39	39	1.5	3.3	51.9	69.3	374	544	569	713	0.79	0.97	194	242	0.4	3.7	4.0
DIP-v	33	41	1.7	2.3	49.7	67.1	313	491	524	598	0.60	0.88	147	174	8.3	59.4	67.8
MSG-v	96	29	0.7	1.5	14.2	34.6	95	206	194	367	0.34	0.46	99	129	88.1	9.1	97.2
	Average performance on ICL-NUIM dataset
Bicubic	34	54	2.8	5.5	59.3	64.2	431	490	558	668	1.15	1.32	210	252	5.0	28.3	33.3
SRfS [14]	42	62	5.5	11.0	73.5	76.1	641	664	636	660	1.72	1.83	287	314	0.0	0.0	0.0
PDN [39]	135	165	93.8	82.9	66.2	70.2	480	509	623	650	1.14	1.24	237	264	2.6	10.5	13.1
DG [13]	36	58	4.3	6.4	64.4	65.5	497	505	663	689	1.28	1.32	234	259	0.6	5.5	6.1
DIP [47]	43	56	10.6	14.2	83.6	83.4	812	806	690	690	2.73	2.58	394	389	1.1	0.9	2.0
MSG [22]	25	36	1.6	3.5	64.1	69.0	489	557	510	534	1.27	1.46	210	255	1.1	7.2	8.3
DIP-v	28	40	2.6	3.9	67.8	69.6	516	548	407	503	1.45	1.56	209	236	9.6	31.9	41.4
MSG-v	24	41	1.3	3.1	56.3	61.1	387	437	527	602	0.94	1.06	157	192	79.9	11.8	91.7
	Average performance on Middlebury dataset
Bicubic	843	1139	10.8	13.9	71.5	76.7	648	748	575	720	0.87	0.76	344	386	4.1	25.3	29.4
SRfS [14]	100	145	21.4	33.6	86.4	89.5	780	810	669	704	1.32	1.28	428	461	0.0	0.0	0.0
PDN [39]	173	225	85.3	76.5	83.4	86.4	744	790	653	711	1.38	1.67	405	467	9.9	28.1	37.9
DG [13]	266	330	15.0	24.5	81.8	84.1	765	784	728	740	1.54	1.73	421	442	0.7	10.6	11.3
DIP [47]	72	104	19.6	24.4	92.4	93.4	927	947	737	717	2.82	2.90	565	592	1.2	5.6	6.8
MSG [22]	228	426	10.8	13.1	81.8	87.2	774	858	649	696	1.96	2.19	477	525	0.2	1.6	1.8
DIP-v	56	87	6.4	10.6	83.1	87.4	728	821	506	568	1.34	1.56	353	409	72.3	18.2	90.5
MSG-v	96	133	7.3	9.2	73.3	79.0	667	757	639	690	1.20	1.35	376	431	10.8	9.9	20.7
	Average performance on the scenes without missing measurements (SimGeo, ICL-NUIM, Vintage)
	${RMSE}_{d}$		${BadPix}_{d} (5 c m)$		${BadPix}_{v} (5)$		${DSSIM}_{v}$		${LPIPS}_{v}$		${Bumpiness}_{d}$		${RMSE}_{v}$		User, 1st	User, 2nd	Top 2
	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x4	x4
Bicubic	46	69	3.5	6.9	43.7	53.3	333	415	452	561	0.97	1.19	206	248	3.0	18.9	21.9
SRfS [14]	55	81	7.3	14.1	74.6	77.4	680	701	743	753	1.60	1.75	303	326	0.0	0.0	0.0
PDN [39]	148	187	94.4	90.1	55.0	59.8	378	412	482	542	0.93	1.06	210	241	1.9	10.5	12.4
DG [13]	47	73	3.9	6.7	52.1	54.4	416	430	561	584	1.02	1.10	209	230	0.4	4.0	4.4
DIP [47]	50	62	10.7	15.9	87.6	88.2	857	853	801	808	2.59	2.79	414	452	0.8	0.8	1.7
MSG [22]	33	39	1.7	3.6	59.8	70.3	454	569	547	622	1.08	1.26	210	258	0.7	5.8	6.4
DIP-v	31	43	2.2	3.3	60.8	69.9	444	548	474	560	1.09	1.32	191	223	10.2	45.5	55.8
MSG-v	38	38	1.1	2.6	38.0	50.1	264	346	385	501	0.69	0.81	135	169	82.7	10.9	93.6
	Average performance on the scenes with missing measurements (Middlebury excluding Vintage)
Bicubic	972	1313	11.8	14.7	71.3	76.6	663	765	570	718	0.77	0.61	358	400	3.8	24.7	28.5
SRfS [14]	100	145	22.2	33.8	86.9	89.8	790	820	676	716	1.26	1.21	441	474	0.0	0.0	0.0
PDN [39]	178	234	88.3	76.1	83.5	86.6	757	803	644	713	1.36	1.69	419	487	11.5	32.8	44.3
DG [13]	298	367	16.3	26.9	82.2	84.7	781	803	716	724	1.55	1.76	442	465	0.9	12.2	13.1
DIP [47]	72	102	18.8	20.7	92.2	93.3	923	943	708	691	2.62	2.68	549	577	1.2	6.4	7.7
MSG [22]	259	488	12.1	14.1	82.0	87.7	785	870	673	711	2.02	2.24	507	552	0.2	0.2	0.5
DIP-v	58	91	7.1	11.4	82.7	87.2	716	811	494	550	1.23	1.41	354	405	80.1	13.8	93.9
MSG-v	107	145	8.1	9.8	73.6	79.2	688	776	634	688	1.19	1.34	404	458	1.3	8.8	10.1
	Average performance on SimGeo, ICL-NUIM, Middlebury
Bicubic	339	462	6.1	9.3	52.4	60.6	437	525	489	611	0.91	1.00	254	296	3.3	20.7	24.0
SRfS [14]	69	101	12.0	20.3	78.5	81.3	715	738	722	741	1.50	1.58	347	372	0.0	0.0	0.0
PDN [39]	157	202	92.4	85.7	64.0	68.2	498	536	533	596	1.07	1.26	276	319	5.0	17.5	22.5
DG [13]	126	166	7.8	13.1	61.6	64.0	531	548	610	628	1.19	1.31	283	305	0.5	6.6	7.1
DIP [47]	57	75	13.3	17.4	89.1	89.8	878	881	771	771	2.60	2.76	457	491	1.0	2.6	3.6
MSG [22]	104	181	5.0	6.9	66.8	75.8	559	664	587	650	1.37	1.57	304	350	0.5	4.0	4.6
DIP-v	40	58	3.8	5.9	67.7	75.4	530	631	481	557	1.14	1.34	242	280	32.3	35.5	67.8
MSG-v	60	72	3.3	4.8	49.2	59.3	398	482	464	560	0.85	0.98	220	260	57.0	10.2	67.3

Table 4. Table 4: Quantitative evaluation on “Cube” with different RGBs from SimGeo dataset. The best result is in bold, the second best is underlined.

	Cube, high-frequency texture
	${RMSE}_{d}$		${BadPix}_{d} (5 c m)$		${BadPix}_{v} (5)$		${DSSIM}_{v}$		${LPIPS}_{v}$		${Bumpiness}_{d}$		${RMSE}_{v}$		User, 1st	User, 2nd	Top 2
	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x4	x4
Bicubic	44	63	2.7	5.2	15.0	27.3	131	204	287	395	0.43	0.61	160	188	0.7	13.2	14.0
SRfS [14]	52	75	6.2	12.1	89.2	80.3	934	818	1036	938	1.73	1.67	361	339	0.0	0.0	0.0
EG [54]	43		1.2		25.4		113		214		0.35		105		0.7	9.6	10.3
PDN [39]	164	219	99.6	99.4	27.5	29.5	156	186	250	368	0.39	0.49	145	171	0.7	4.4	5.1
DG [13]	44	67	1.9	4.2	26.4	30.1	218	240	411	437	0.44	0.55	139	159	0.7	7.4	8.1
DIP [47]	45	48	6.4	8.5	93.5	92.5	963	947	906	918	2.98	2.50	530	494	0.0	0.7	0.7
MSG [22]	29	38	1.0	2.6	60.1	77.9	445	653	687	877	0.79	0.98	176	233	0.0	0.0	0.0
DIP-v	26	36	0.8	1.6	56.2	60.8	352	413	613	653	0.64	0.89	146	162	5.9	58.1	64.0
MSG-v	102	20	0.3	0.7	9.3	51.0	70	316	179	676	0.20	0.39	77	125	91.2	6.6	97.8
	Cube, no texture
Bicubic	44	63	2.7	5.2	15.0	27.3	131	204	287	395	0.43	0.61	160	188	0.0	5.9	5.9
SRfS [14]	43	63	2.1	4.5	53.4	51.7	516	476	754	728	0.67	0.89	219	228	0.0	0.0	0.0
EG [54]	43		1.2		25.4		128		282		0.35		105		0.0	2.9	2.9
PDN [39]	164	219	99.6	99.4	26.3	29.3	162	185	314	353	0.38	0.49	145	171	0.7	4.4	5.1
DG [13]	44	67	1.9	4.2	26.4	30.1	218	240	411	437	0.44	0.55	139	159	0.0	2.2	2.2
DIP [47]	72	56	23.2	17.3	94.3	99.1	912	980	1026	1133	2.05	4.22	434	683	0.0	0.7	0.7
MSG [22]	29	26	1.0	1.7	30.7	49.8	199	314	509	642	0.42	0.47	157	171	0.0	6.6	6.6
DIP-v	26	35	0.8	1.4	15.1	45.4	95	237	347	478	0.28	0.35	111	107	1.5	76.5	77.9
MSG-v	9	19	0.3	0.4	6.0	13.0	50	73	141	213	0.17	0.21	77	82	97.8	0.7	98.5

Table 5. Table 5: Quantitative evaluation on “Sphere and cylinder” with different RGBs from SimGeo dataset. The best result is in bold, the second best is underlined.

	Sphere and cylinder, high-frequency texture
	${RMSE}_{d}$		${BadPix}_{d} (5 c m)$		${BadPix}_{v} (5)$		${DSSIM}_{v}$		${LPIPS}_{v}$		${Bumpiness}_{d}$		${RMSE}_{v}$		User, 1st	User, 2nd	Top 2
	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x4	x4
Bicubic	57	82	4.1	8.1	20.1	36.7	189	294	313	420	0.67	0.98	189	234	0.0	0.7	0.7
SRfS [14]	70	102	12.1	24.6	91.9	91.8	887	865	1025	1008	2.43	2.41	417	403	0.0	0.0	0.0
EG [54]	55		2.4		30.4		143		326		0.50		130		0.0	1.5	1.5
PDN [39]	157	197	99.3	98.9	40.7	54.1	198	242	295	461	0.60	0.77	150	187	1.5	9.6	11.0
DG [13]	56	87	3.2	6.3	30.9	35.2	265	285	372	386	0.66	0.77	166	180	0.0	1.5	1.5
DIP [47]	46	69	3.9	27.2	97.0	99.2	965	975	1062	1014	4.01	4.80	548	696	1.5	2.9	4.4
MSG [22]	41	41	1.4	3.6	72.6	85.6	626	820	859	960	0.98	1.43	229	314	0.0	0.0	0.0
DIP-v	28	43	1.2	2.4	69.2	86.0	560	850	766	832	0.56	1.45	142	242	32.4	52.9	85.3
MSG-v	99	37	0.6	2.0	14.3	53.0	94	334	267	583	0.29	0.55	96	164	64.7	30.9	95.6
	Sphere and cylinder, no texture
Bicubic	57	82	4.1	8.1	20.2	36.8	189	294	325	437	0.67	0.98	190	233	0.0	0.7	0.7
SRfS [14]	59	85	4.6	8.6	51.4	70.8	430	619	657	766	0.77	1.25	193	256	0.0	0.0	0.0
EG [54]	56		2.4		30.9		160		383		0.50		128		0.0	1.5	1.5
PDN [39]	157	197	99.3	98.9	38.0	44.1	202	218	294	386	0.58	0.76	150	186	5.9	17.6	23.5
DG [13]	57	87	3.2	6.4	31.0	35.3	265	284	396	409	0.66	0.78	165	180	0.7	2.2	2.9
DIP [47]	49	56	5.0	5.5	85.6	81.6	856	662	927	723	1.01	0.96	244	249	1.5	0.0	1.5
MSG [22]	40	37	1.4	3.1	45.6	64.5	288	444	509	610	0.65	0.76	183	218	0.7	0.7	1.5
DIP-v	35	39	1.4	1.8	41.0	72.6	210	523	517	643	0.47	0.70	130	141	9.6	64.7	74.3
MSG-v	14	27	0.7	1.3	8.5	18.0	77	93	174	200	0.27	0.32	96	110	81.6	12.5	94.1
	Sphere and cylinder, low-frequency texture
Bicubic	57	82	4.1	8.1	20.1	36.7	189	294	313	420	0.67	0.98	189	234	0.0	2.2	2.2
SRfS [14]	62	91	6.8	14.9	74.9	81.0	691	738	961	956	1.38	1.65	311	335	0.0	0.0	0.0
EG [54]	54		2.4		30.4		160		377		0.50		129		0.7	7.4	8.1
PDN [39]	157	197	99.3	98.9	37.9	44.5	202	219	299	397	0.58	0.76	150	186	0.7	36.0	36.8
DG [13]	56	87	3.2	6.3	30.9	35.2	265	285	372	386	0.66	0.77	166	180	0.0	3.7	3.7
DIP [47]	49	52	8.0	4.9	85.5	84.7	796	812	821	924	1.19	1.18	267	250	0.0	0.0	0.0
MSG [22]	41	41	1.3	3.0	39.6	66.2	264	458	493	612	0.64	0.74	181	213	0.0	1.5	1.5
DIP-v	38	42	1.7	2.2	48.0	60.4	238	351	456	516	0.50	0.61	128	152	0.7	47.8	48.5
MSG-v	16	26	0.7	1.2	8.5	17.5	76	92	156	181	0.27	0.31	97	100	97.8	1.5	99.3

Table 6. Table 6: Quantitative evaluation on “Lucy” from SimGeo dataset. The best result is in bold, the second best is underlined.

	Lucy
	${RMSE}_{d}$		${BadPix}_{d} (5 c m)$		${BadPix}_{v} (5)$		${DSSIM}_{v}$		${LPIPS}_{v}$		${Bumpiness}_{d}$		${RMSE}_{v}$		User, 1st	User, 2nd	Top 2
	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x4	x4
Bicubic	72	103	6.8	13.0	48.8	65.0	355	519	398	497	1.37	1.74	267	328	2.2	24.3	26.5
SRfS [14]	82	113	13.2	20.8	84.6	87.1	811	857	781	792	1.90	2.28	367	407	0.0	0.0	0.0
EG [54]	69		3.5		56.2		357		426		1.05		220		0.0	0.7	0.7
PDN [39]	173	234	99.0	98.8	64.9	68.9	456	535	368	480	1.24	1.47	251	303	0.0	1.5	1.5
DG [13]	69	108	4.9	11.0	65.5	68.6	523	562	558	565	1.28	1.50	249	281	0.0	0.7	0.7
DIP [47]	53	75	4.7	11.4	87.4	95.2	827	908	615	778	2.02	2.93	344	478	0.7	0.7	1.5
MSG [22]	54	53	2.7	5.4	62.9	71.7	444	577	480	578	1.30	1.42	259	306	1.5	13.2	14.7
DIP-v	44	55	4.6	4.4	69.0	77.5	421	574	446	468	1.15	1.27	223	239	0.0	56.6	56.6
MSG-v	74	47	1.6	3.7	38.8	55.0	205	325	251	348	0.82	0.96	156	195	95.6	2.2	97.8

Table 7. Table 7: Quantitative evaluation on RGBD frames from ICL-NUIM “Living Room” sequence. The best result is in bold, the second best is underlined.

	Painting
	${RMSE}_{d}$		${BadPix}_{d} (5 c m)$		${BadPix}_{v} (5)$		${DSSIM}_{v}$		${LPIPS}_{v}$		${Bumpiness}_{d}$		${RMSE}_{v}$		User, 1st	User, 2nd	Top 2
	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x4	x4
Bicubic	28	47	2.5	5.6	57.1	64.1	423	514	544	649	0.95	1.15	213	265	4.4	47.8	52.2
SRfS [14]	39	60	6.5	15.9	78.4	81.2	707	722	612	661	1.47	1.55	308	337	0.0	0.0	0.0
EG [54]	36		3.1		61.9		481		720		0.94		231		0.0	3.7	3.7
PDN [39]	151	215	99.3	99.2	65.2	70.2	488	532	669	709	0.89	1.01	237	275	4.4	10.3	14.7
DG [13]	31	49	2.4	5.5	61.9	63.9	503	506	678	700	1.08	1.13	232	272	0.7	3.7	4.4
DIP [47]	30	37	4.0	4.7	80.4	79.5	802	766	630	612	2.18	1.82	362	341	0.0	0.0	0.0
MSG [22]	21	29	1.2	2.2	63.7	67.9	495	570	475	507	0.97	1.12	203	243	2.2	5.1	7.4
DIP-v	22	32	2.3	3.2	70.1	70.3	567	564	386	501	1.07	1.12	210	239	2.9	21.3	24.3
MSG-v	17	34	0.9	1.8	51.4	58.0	354	410	532	607	0.67	0.77	142	170	85.3	8.1	93.4
	Sofa
Bicubic	38	58	1.8	3.6	75.4	77.0	566	616	704	764	2.12	2.33	212	250	3.7	15.4	19.1
SRfS [14]	39	58	2.0	3.5	82.3	88.1	715	832	631	743	2.97	3.45	310	405	0.0	0.0	0.0
EG [54]	42		2.5		79.0		598		767		2.28		213		0.0	8.8	8.8
PDN [39]	86	91	71.0	70.8	83.3	83.0	641	658	784	763	2.40	2.50	260	264	0.7	3.7	4.4
DG [13]	41	63	3.2	4.4	77.7	77.9	624	632	823	855	2.30	2.33	255	263	0.0	5.1	5.1
DIP [47]	45	57	7.1	12.7	93.1	94.0	928	946	758	738	3.91	3.99	518	560	0.0	0.0	0.0
MSG [22]	27	36	1.2	2.3	80.6	85.7	718	791	606	610	2.71	3.22	254	316	0.0	0.7	0.7
DIP-v	27	43	0.9	2.0	79.1	82.5	645	718	414	585	2.67	3.07	215	266	19.1	47.8	66.9
MSG-v	35	44	0.7	1.6	74.0	75.7	537	585	710	759	1.96	2.10	165	196	76.5	18.4	94.9
	Plant
Bicubic	38	58	3.7	6.4	75.9	79.9	562	610	688	763	1.58	1.79	249	290	1.5	22.1	23.5
SRfS [14]	46	65	5.8	9.5	82.9	85.0	658	692	632	649	1.96	2.13	280	309	0.0	0.0	0.0
EG [54]	43		4.5		82.2		568		677		1.64		255		0.0	0.7	0.7
PDN [39]	88	89	94.5	37.8	79.5	82.5	574	612	659	699	1.46	1.60	269	305	4.4	7.4	11.8
DG [13]	40	63	3.9	6.7	79.5	81.1	611	622	745	785	1.67	1.70	268	291	2.2	11.0	13.2
DIP [47]	38	47	6.9	6.1	93.9	92.8	919	880	764	723	4.33	3.95	490	437	0.0	0.7	0.7
MSG [22]	31	44	2.3	3.7	78.0	81.8	571	645	582	495	1.62	1.84	234	285	0.0	11.8	11.8
DIP-v	31	40	4.7	4.8	83.5	84.1	694	707	463	555	2.25	2.21	262	276	11.0	33.1	44.1
MSG-v	27	44	1.8	3.9	74.3	77.8	524	575	639	720	1.31	1.47	194	236	80.9	13.2	94.1

Table 8. Table 8: Quantitative evaluation on RGBD frames from ICL-NUIM “Office Room” sequence. The best result is in bold, the second best is underlined.

	Office
	${RMSE}_{d}$		${BadPix}_{d} (5 c m)$		${BadPix}_{v} (5)$		${DSSIM}_{v}$		${LPIPS}_{v}$		${Bumpiness}_{d}$		${RMSE}_{v}$		User, 1st	User, 2nd	Top 2
	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x4	x4
Bicubic	47	80	4.0	7.8	24.4	34.1	216	285	412	594	0.81	0.95	208	254	19.9	44.1	64.0
SRfS [14]	49	89	5.8	14.4	53.4	54.4	595	593	690	636	1.71	1.66	298	302	0.0	0.0	0.0
PDN [39]	185	185	99.3	90.5	36.5	50.2	250	294	457	518	0.76	0.92	234	272	0.7	3.7	4.4
DG [13]	49	85	9.0	11.9	36.5	37.6	319	330	534	571	1.03	1.05	240	266	0.0	0.7	0.7
DIP [47]	76	109	30.3	48.2	72.1	73.9	726	819	690	797	2.45	2.70	372	408	1.5	1.5	2.9
MSG [22]	35	48	2.4	6.8	35.4	44.5	263	360	415	543	0.83	0.95	199	247	2.2	3.7	5.9
DIP-v	40	65	3.8	7.4	45.4	47.9	311	352	414	504	1.08	1.18	205	235	17.6	25.0	42.6
MSG-v	32	65	1.9	5.3	19.3	29.6	157	224	313	432	0.59	0.72	151	198	58.1	21.3	79.4
	Coat rack
Bicubic	13	20	1.5	3.0	73.1	75.3	507	539	537	651	0.54	0.60	171	196	0.0	19.1	19.1
SRfS [14]	24	28	3.8	5.4	82.3	80.5	672	556	650	612	0.83	0.57	237	203	0.0	0.0	0.0
EG [54]	13		1.2		77.8		541		550		0.55		186		0.7	7.4	8.1
PDN [39]	140	191	99.6	99.9	77.1	78.0	544	557	621	631	0.48	0.50	178	193	5.1	28.7	33.8
DG [13]	13	20	1.4	3.2	74.4	75.6	530	532	593	621	0.54	0.58	166	201	0.0	9.6	9.6
DIP [47]	15	24	1.9	3.5	85.5	85.4	766	701	625	624	1.15	0.97	256	246	4.4	2.2	6.6
MSG [22]	11	17	0.9	1.6	73.3	76.1	522	546	523	554	0.51	0.55	165	189	2.2	16.2	18.4
DIP-v	13	17	1.8	2.0	75.2	75.5	543	542	422	463	0.62	0.60	171	181	0.7	12.5	13.2
MSG-v	11	18	0.8	2.2	71.4	74.2	482	502	516	563	0.42	0.48	136	161	86.8	4.4	91.2
	Displays
Bicubic	41	63	3.2	6.4	49.9	54.9	315	374	460	585	0.92	1.08	208	256	0.7	21.3	22.1
SRfS [14]	53	75	9.0	17.3	61.9	67.3	500	591	599	659	1.35	1.60	288	328	0.0	0.0	0.0
EG [54]	46		5.9		66.7		388		587		0.94		216		0.0	2.9	2.9
PDN [39]	159	220	99.2	99.0	55.4	57.2	381	403	547	580	0.85	0.95	242	275	0.0	9.6	9.6
DG [13]	43	66	5.8	6.7	56.5	56.7	395	406	606	601	1.06	1.10	243	265	0.7	2.9	3.7
DIP [47]	52	60	13.4	9.7	76.9	74.6	732	724	672	645	2.36	2.06	365	344	0.7	0.7	1.5
MSG [22]	26	42	1.7	4.4	53.9	58.0	367	430	461	493	0.97	1.08	204	251	0.0	5.9	5.9
DIP-v	32	45	2.4	4.0	53.7	57.6	336	407	344	409	1.00	1.18	191	221	5.9	51.5	57.4
MSG-v	23	43	1.4	3.5	47.2	51.0	271	324	451	531	0.69	0.80	152	190	91.9	5.1	97.1

Table 9. Table 9: Quantitative evaluation on samples with small number of missing measurements from Middlebury dataset. The best result is in bold, the second best is underlined.

	Vintage
	${RMSE}_{d}$		${BadPix}_{d} (5 c m)$		${BadPix}_{v} (5)$		${DSSIM}_{v}$		${LPIPS}_{v}$		${Bumpiness}_{d}$		${RMSE}_{v}$		User, 1st	User, 2nd	Top 2
	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x4	x4
Bicubic	67	98	4.6	9.0	72.8	77.3	558	649	602	729	1.51	1.64	258	302	5.9	28.7	34.6
SRfS [14]	101	145	16.8	32.3	83.7	87.2	721	749	631	634	1.64	1.68	346	382	0.0	0.0	0.0
PDN [39]	140	174	67.6	79.0	82.3	85.7	663	714	706	700	1.51	1.57	319	350	0.0	0.0	0.0
DG [13]	72	107	7.1	10.4	79.4	80.1	666	669	796	840	1.50	1.52	290	300	0.0	0.7	0.7
DIP [47]	74	117	24.8	46.9	93.6	94.2	953	965	910	872	4.01	4.16	656	687	0.7	0.7	1.5
MSG [22]	41	59	3.2	6.8	80.6	84.6	708	785	510	610	1.62	1.85	292	364	0.0	9.6	9.6
DIP-v	42	67	2.7	5.9	85.2	88.8	804	884	579	674	1.94	2.48	343	435	25.7	44.1	69.9
MSG-v	33	65	2.5	5.9	71.4	77.6	536	643	670	702	1.29	1.43	211	268	67.6	16.2	83.8
	Recycle
Bicubic	587	880	9.2	16.6	70.6	78.6	575	721	474	576	1.23	1.17	329	398	0.0	11.0	11.0
SRfS [14]	47	72	10.2	22.1	86.1	88.8	715	772	610	623	1.68	1.81	376	410	0.0	0.0	0.0
PDN [39]	95	128	90.5	79.8	84.0	85.7	635	701	523	589	1.66	2.18	364	457	0.0	6.6	6.6
DG [13]	39	82	3.4	11.7	81.6	83.6	696	719	602	617	1.75	1.99	328	383	2.9	65.4	68.4
DIP [47]	29	45	3.9	9.3	91.0	91.8	871	923	576	605	2.95	3.31	434	500	1.5	5.9	7.4
MSG [22]	106	1182	5.8	11.9	82.8	89.6	741	869	624	661	2.60	3.01	485	550	0.7	0.0	0.7
DIP-v	20	34	1.5	4.2	78.9	85.0	575	735	388	485	1.56	1.86	273	332	94.9	3.7	98.5
MSG-v	51	76	3.9	7.9	73.9	82.1	603	737	520	564	1.66	2.02	368	473	0.0	7.4	7.4

Table 10. Table 10: Quantitative evaluation on samples with small number of missing measurements from Middlebury dataset. The best result is in bold, the second best is underlined.

	Umbrella
	${RMSE}_{d}$		${BadPix}_{d} (5 c m)$		${BadPix}_{v} (5)$		${DSSIM}_{v}$		${LPIPS}_{v}$		${Bumpiness}_{d}$		${RMSE}_{v}$		User, 1st	User, 2nd	Top 2
	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x4	x4
Bicubic	1013	1507	6.9	12.1	77.7	80.9	749	837	747	886	0.60	0.60	323	380	5.9	35.3	41.2
SRfS [14]	148	217	19.4	35.5	87.5	90.8	843	853	797	831	0.71	0.78	397	443	0.0	0.0	0.0
PDN [39]	220	287	94.9	89.1	86.6	88.1	799	828	847	882	0.79	1.13	367	452	3.7	22.8	26.5
DG [13]	365	507	9.1	20.3	84.6	87.3	846	878	781	856	0.92	1.36	399	457	0.0	0.7	0.7
DIP [47]	138	145	48.5	21.6	90.5	93.2	915	953	737	722	1.19	1.65	467	528	2.9	16.2	19.1
MSG [22]	292	555	7.4	12.4	84.3	88.1	834	896	678	787	1.27	1.47	442	496	0.0	0.7	0.7
DIP-v	91	129	3.4	5.7	83.4	85.3	796	854	604	598	0.67	0.79	318	352	82.4	8.1	90.4
MSG-v	129	218	5.2	9.7	79.1	82.3	778	842	800	890	0.72	0.89	348	427	5.1	16.2	21.3
	Classroom1
Bicubic	966	1371	6.7	9.0	75.8	78.3	636	728	581	784	0.41	0.30	268	295	12.5	37.5	50.0
SRfS [14]	135	202	18.5	28.5	82.6	85.7	761	781	718	756	0.62	0.62	332	363	0.0	0.0	0.0
PDN [39]	239	324	96.0	91.0	81.5	82.9	739	759	751	807	0.62	0.76	279	342	16.9	26.5	43.4
DG [13]	307	503	8.8	16.8	82.0	82.7	743	762	766	812	0.74	0.87	313	337	0.0	1.5	1.5
DIP [47]	96	145	17.0	22.4	94.4	94.6	956	952	789	751	1.94	2.12	540	557	0.0	1.5	1.5
MSG [22]	297	408	7.3	10.0	81.2	83.8	723	810	626	604	0.90	1.01	351	391	0.0	0.7	0.7
DIP-v	69	117	4.1	9.3	81.0	86.0	700	789	516	537	0.64	0.86	266	327	64.0	18.4	82.4
MSG-v	127	203	5.4	8.4	76.9	79.4	678	735	739	803	0.60	0.64	283	330	2.2	11.8	14.0

Table 11. Table 11: Quantitative evaluation on samples with large number of missing measurements from Middlebury dataset. The best result is in bold, the second best is underlined.

	Playroom
	${RMSE}_{d}$		${BadPix}_{d} (5 c m)$		${BadPix}_{v} (5)$		${DSSIM}_{v}$		${LPIPS}_{v}$		${Bumpiness}_{d}$		${RMSE}_{v}$		User, 1st	User, 2nd	Top 2
	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x8	x4	x4	x4
Bicubic	1263	1744	14.4	20.4	72.0	76.9	684	783	509	675	0.80	0.52	386	441	0.0	2.2	2.2
SRfS [14]	97	151	26.9	42.1	88.1	91.2	802	829	663	715	1.24	1.08	493	540	0.0	0.0	0.0
PDN [39]	181	253	85.0	69.7	86.3	89.2	820	862	583	656	1.54	1.88	472	543	25.7	61.8	87.5
DG [13]	425	133	22.4	25.0	85.4	86.0	845	826	779	691	1.96	1.63	519	469	1.5	2.2	3.7
DIP [47]	58	91	18.4	20.0	93.0	93.2	941	937	647	612	3.09	2.86	602	592	1.5	2.9	4.4
MSG [22]	433	349	16.1	22.3	85.8	89.9	855	911	685	705	2.51	2.74	576	616	0.0	0.0	0.0
DIP-v	49	83	5.4	12.2	83.8	88.5	728	847	459	530	1.29	1.52	357	433	70.6	27.2	97.8
MSG-v	112	166	9.4	15.5	75.2	80.1	721	810	565	615	1.46	1.67	453	510	0.0	3.7	3.7
	Backpack
Bicubic	985	1078	14.3	11.5	62.7	69.4	639	730	564	692	0.60	0.45	392	424	2.2	34.6	36.8
SRfS [14]	69	83	18.9	25.5	89.9	89.9	831	847	630	651	1.37	1.26	500	505	0.0	0.0	0.0
PDN [39]	173	207	81.6	65.2	80.4	85.4	770	820	609	719	1.59	1.96	519	553	3.7	37.5	41.2
DG [13]	325	465	26.5	39.9	77.0	82.0	765	808	650	696	1.62	2.07	529	545	0.7	2.9	3.7
DIP [47]	41	67	8.2	17.9	93.2	94.5	943	984	766	692	3.36	2.95	639	645	1.5	11.8	13.2
MSG [22]	211	170	15.1	10.4	76.5	86.9	762	856	671	723	2.11	2.31	577	609	0.7	0.0	0.7
DIP-v	38	62	6.2	12.9	82.5	88.7	677	768	457	496	1.31	1.43	409	448	90.4	5.1	95.6
MSG-v	113	89	10.8	5.9	65.3	72.5	663	752	577	635	1.07	1.15	462	480	0.0	5.9	5.9
	Jadeplant
Bicubic	1017	1297	19.3	18.4	68.8	75.7	695	788	545	696	0.97	0.62	449	464	2.3	27.7	30.0
SRfS [14]	105	143	39.5	48.9	87.2	92.7	787	839	637	719	1.96	1.70	551	583	0.0	0.0	0.0
PDN [39]	161	205	81.8	62.0	82.4	88.0	778	849	551	625	1.95	2.20	512	572	19.1	41.4	60.5
DG [13]	326	512	27.8	47.6	82.6	86.8	791	823	718	670	2.28	2.66	567	601	0.0	0.5	0.5
DIP [47]	70	121	16.7	32.8	91.2	92.3	913	911	735	764	3.19	3.21	615	638	0.0	0.5	0.5
MSG [22]	216	263	21.1	17.8	81.5	87.6	796	880	751	783	2.73	2.90	614	649	0.0	0.0	0.0
DIP-v	84	121	21.8	24.0	86.6	89.5	820	870	542	654	1.92	1.99	503	535	78.2	20.5	98.6
MSG-v	109	117	13.9	11.2	71.0	79.0	688	781	605	622	1.61	1.66	507	529	0.5	8.2	8.6

Equations20

ℓ = \frac{2 μ _{1} μ _{2}}{μ _{1}^{2} + μ _{2}^{2}}, c = \frac{2 σ _{1} σ _{2}}{σ _{1}^{2} + σ _{2}^{2}}, s = \frac{σ _{12}}{σ _{1} σ _{2}}, SSI M_{v} (I_{1}, I_{2}) = \frac{1}{N} ij \sum ℓ_{ij} \cdot c_{ij} \cdot s_{ij},

ℓ = \frac{2 μ _{1} μ _{2}}{μ _{1}^{2} + μ _{2}^{2}}, c = \frac{2 σ _{1} σ _{2}}{σ _{1}^{2} + σ _{2}^{2}}, s = \frac{σ _{12}}{σ _{1} σ _{2}}, SSI M_{v} (I_{1}, I_{2}) = \frac{1}{N} ij \sum ℓ_{ij} \cdot c_{ij} \cdot s_{ij},

N N_{v} (I_{1}, I_{2}) = ℓ \sum \frac{1}{H _{ℓ} W _{ℓ}} ij \sum ∥ x_{1 ℓ, ij} - x_{2 ℓ, ij} ∥_{2}^{2} .

N N_{v} (I_{1}, I_{2}) = ℓ \sum \frac{1}{H _{ℓ} W _{ℓ}} ij \sum ∥ x_{1 ℓ, ij} - x_{2 ℓ, ij} ∥_{2}^{2} .

RMS E_{v} (d_{1}, d_{2}) = MS E_{v} (d_{1}, d_{2}), MS E_{v} (d_{1}, d_{2}) = \frac{1}{3 N} ij, m \sum ∥ e_{m} \cdot n_{1, ij} - e_{m} \cdot n_{2, ij} ∥_{2}^{2},

RMS E_{v} (d_{1}, d_{2}) = MS E_{v} (d_{1}, d_{2}), MS E_{v} (d_{1}, d_{2}) = \frac{1}{3 N} ij, m \sum ∥ e_{m} \cdot n_{1, ij} - e_{m} \cdot n_{2, ij} ∥_{2}^{2},

L (d_{1}, d_{2}) = La p_{1} (d_{1}, d_{2}) + w \cdot MS E_{v} (d_{1}, d_{2}) .

L (d_{1}, d_{2}) = La p_{1} (d_{1}, d_{2}) + w \cdot MS E_{v} (d_{1}, d_{2}) .

d_{θ^{*}}^{SR} = CNN_{θ^{*}}, θ^{*} = arg min_{θ} MS E_{d} (D d_{θ}^{SR}, d^{LR}),

d_{θ^{*}}^{SR} = CNN_{θ^{*}}, θ^{*} = arg min_{θ} MS E_{d} (D d_{θ}^{SR}, d^{LR}),

d_{θ^{*}}^{SR} = CNN_{θ^{*}}^{(1)}, I_{θ} = CNN_{θ}^{(2)}, θ^{*} = arg min_{θ} MS E_{d} (D d_{θ}^{SR}, d^{LR}) + w_{I} \cdot La p_{1} (I_{θ}, I^{HR}),

d_{θ^{*}}^{SR} = CNN_{θ^{*}}^{(1)}, I_{θ} = CNN_{θ}^{(2)}, θ^{*} = arg min_{θ} MS E_{d} (D d_{θ}^{SR}, d^{LR}) + w_{I} \cdot La p_{1} (I_{θ}, I^{HR}),

\mathrm{BadPix_{d}}(\tau|d_{1},d_{2})=\frac{1}{N}\big{|}\big{\{}ij:\big{|}d_{1,ij}-d_{2,ij}\big{|}>\tau\big{\}}\big{|},

\mathrm{BadPix_{d}}(\tau|d_{1},d_{2})=\frac{1}{N}\big{|}\big{\{}ij:\big{|}d_{1,ij}-d_{2,ij}\big{|}>\tau\big{\}}\big{|},

\mathrm{BadPix_{d}}(\tau{\scriptstyle\%}|d_{1},d_{2})=\frac{1}{N}\big{|}\big{\{}ij:\big{|}\frac{d_{1,ij}-d_{2,ij}}{d_{2,ij}}\big{|}>\frac{\tau}{100}\big{\}}\big{|},

\mathrm{BadPix_{d}}(\tau{\scriptstyle\%}|d_{1},d_{2})=\frac{1}{N}\big{|}\big{\{}ij:\big{|}\frac{d_{1,ij}-d_{2,ij}}{d_{2,ij}}\big{|}>\frac{\tau}{100}\big{\}}\big{|},

Bumpines s_{d} (d_{1}, d_{2}) = \frac{1}{N} ij \sum min (0.05, ∥ H_{d_{1} - d_{2}} (i, j) ∥_{F}) \cdot 100,

Bumpines s_{d} (d_{1}, d_{2}) = \frac{1}{N} ij \sum min (0.05, ∥ H_{d_{1} - d_{2}} (i, j) ∥_{F}) \cdot 100,

RMS E_{v}^{1} (d_{1}, d_{2}) = \frac{1}{N} i, j \sum ∥ e \cdot n_{1, ij} - e \cdot n_{2, ij} ∥_{2}^{2} .

RMS E_{v}^{1} (d_{1}, d_{2}) = \frac{1}{N} i, j \sum ∥ e \cdot n_{1, ij} - e \cdot n_{2, ij} ∥_{2}^{2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

voyleg/perceptual-depth-sr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Perceptual Deep Depth Super-Resolution

Oleg Voynov1, Alexey Artemov1, Vage Egiazarian1, Alexander Notchenko1,

Gleb Bobrovskikh1,2, Denis Zorin3,1, Evgeny Burnaev1

1Skolkovo Institute of Science and Technology, 2Higher School of Economics,

3New York University

{oleg.voinov, a.artemov, vage.egiazarian, alexandr.notchenko}@skoltech.ru,

[email protected], [email protected], [email protected]

Abstract

RGBD images, combining high-resolution color and lower-resolution depth from various types of depth sensors, are increasingly common. One can significantly improve the resolution of depth maps by taking advantage of color information; deep learning methods make combining color and depth information particularly easy.

However, fusing these two sources of data may lead to a variety of artifacts. If depth maps are used to reconstruct 3D shapes, e.g., for virtual reality applications, the visual quality of upsampled images is particularly important.

The main idea of our approach is to measure the quality of depth map upsampling using renderings of resulting 3D surfaces. We demonstrate that a simple visual appearance-based loss, when used with either a trained CNN or simply a deep prior, yields significantly improved 3D shapes, as measured by a number of existing perceptual metrics. We compare this approach with a number of existing optimization and learning-based techniques.

1 Introduction

RGBD images are increasingly common as sensor technology becomes more widely available and affordable. They can be used for reconstruction of the 3D shapes of objects and their surface appearance. The better the quality of the depth component, the more reliable the reconstruction.

Unfortunately, for most methods of depth acquisition the resolution and quality of the depth component is insufficient for accurate surface reconstruction. As the resolution of the RGB component is usually several times higher and there is a high correlation between structural features of the color image and the depth map (e.g., object edges) it is natural to use the color image for depth map super-resolution, i.e. upsampling of the depth map. Convolutional neural networks are a natural fit for this problem as they can easily fuse heterogeneous information.

A critical aspect of any upsampling method is the measure of quality it optimizes (i.e., the loss function), whether the technique is data-driven or not. In this paper we focus on applications that require reconstruction of 3D geometry visible to the user, like acquisition of realistic 3D scenes for virtual or augmented reality and computer graphics. In these applications the visual appearance of the resulting 3D shape, i.e., how the surface looks when observed under various lighting conditions, is of particular importance.

Most existing research on depth super-resolution is dominated by simple measures based on pointwise deviation of depth values. However, direct pointwise difference of the depth maps do not capture the visual difference between the corresponding 3D shapes: for example, low-amplitude high-frequency variations of depth may correspond to significant difference in appearance, while conversely, relatively large smooth changes in depth may be perceptually less relevant, as illustrated in Figure 1.

Hence, we propose to compare the rendered images of the surface instead of the depth values directly. In this paper we explore depth map super-resolution using a simple loss function based on visual differences. Our loss function can be computed efficiently and is shown to be highly correlated with more elaborate perceptual metrics. We demonstrate that this simple idea used with two deep learning-based RGBD super-resolution algorithms results in a dramatic improvement of visual quality according to perceptual metrics and an informal perceptual study. We compare our results with six state-of-the-art methods of depth super-resolution that are based on distinct principles and use several types of loss functions.

In summary, our contributions are as follows: (1) we demonstrate that a simple and efficient visual difference-based metric for depth map comparison can be, on the one hand, easily combined with neural network-based whole-image upsampling techniques, and, on the other hand, is correlated with established proxies for human perception, validated with respect to experimental measurements; (2) we demonstrate with extensive comparisons that with the use of this metric two methods of depth map super-resolution, one based on a trainable CNN and the other based on the deep prior, yield high-quality results as measured by multiple perceptual metrics. To the best of our knowledge, our paper is the first to systematically study the performance of visual difference-based depth super-resolution across a variety of datasets, methods, and quality measures, including a basic human evaluation.

Throughout the paper we use the term depth map to refer to the depth component of an RGBD image, and the term normal map to refer to the map of the same resolution with the 3D surface normal direction computed from the depth map at each pixel. Finally, the rendering of a depth map refers to the grayscale image obtained by constructing a 3D triangulation of the height field represented by the depth map, via computing the normal map from this triangulation, and rendering it using fixed material properties and a choice of lighting. This is distinct from a commonly used depth map visualization with grayscale values obtained from the depth values by simple scaling. We describe this in more detail in Section 3.

2 Related work

2.1 Image quality measures

Quality measures play two important roles in image super-resolution: on the one hand, they are used to formulate an optimization functional or a loss function, on the other hand, they are used to evaluate the quality of the results. Ideally, the same function should serve both purposes, however, in some instances it may be optimal to choose different functions for evaluation and optimization. While in the former case the top priority is to capture the needs of the application, in the latter case the efficiency of evaluation and differentiability are significant considerations.

In most works on depth map reconstruction and upsampling a limited number of simple metrics are used, both for optimization and final evaluation. Typically these are scaled $L_{2}$ or $L_{1}$ norms of depth deviations (see e.g. [9]).

Another set of measures introduced in [19, 20] and primarily used for evaluation, not optimization or learning, consists of heuristic measures of various aspects of the depth map geometry: foreground flattening/thinning, fuzziness, bumpiness, etc. Most of them require a very specific segmentation of the image for detection of flat areas and depth discontinuities.

Visual similarity measures, well-established in the area of photo-processing, aim to be consistent with human judgment, in the sense of similarity ordering (which of the two images is more similar to the ground truth?). The examples include (1) the metrics based on simple vision models of structural similarity SSIM [52], FSIM [57], MSSIM [53], (2) based on a sophisticated model of low-level visual processing [35], or (3) on convolutional neural networks (see [58] for a detailed overview). The latter use a simple distance measure on deep features learned for an image understanding task, e.g. $L_{2}$ distance on the features learned for image classification, and have been demonstrated to outperform statistical measures such as SSIM.

2.2 Depth super-resolution

Depth super-resolution is closely related to a number of depth processing tasks, such as denoising, enhancement, inpainting, and densification (e.g., [5, 6, 8, 21, 33, 34, 45, 46, 55]). We directly focus on the problem of super-resolution, or more specifically, estimation of high-resolution depth map from a single low-resolution depth map and a high-resolution RGB image.

Convolutional neural networks

have achieved most impressive performance among learning-based methods in high-level computer vision tasks and recently have been applied to depth super-resolution [22, 30, 39, 43]. One approach [22] is to resolve ambiguity in the depth map upsampling by explicitly adding high-frequency features from high-resolution RGB data. Another, hybrid approach [39, 43] is to add a subsequent optimization stage to a CNN to produce sharper results. Different approaches to CNN-based photo-guided depth super-resolution include linear filtering with CNN-derived kernels [26], deep fusion of time-of-flight depth and stereo images [1], and generative adversarial networks [62].

These techniques use either $L_{2}$ or $L_{1}$ norm of the depth differences as the basis of their loss functions, often combined with regularizers of different types. The recent approach of [62] is the closest to ours: it uses the difference of gradients as one of the loss terms to capture some of the visual information. For evaluation, these works report root mean square error (RMSE), mean absolute error (MAE), peak signal-to-noise ratio (PSNR), all applied directly to depth maps, and, rarely [4, 43, 44, 62], perceptual SSIM also applied directly to depth maps. In contrast, we propose to measure the perceptual quality of depth map renderings.

Dictionary learning

has also been investigated for depth super-resolution [11, 13, 29], however, compared to CNNs, it is typically restricted to smaller dimensions and as a result to structurally simpler depth maps.

Variational approach

aims to combine RGB and depth information explicitly by carefully designing an optimization functional, without relying on learning. Most relevant examples employ shape-from-shading problem statement for single-image [14] or multiple-image [38] depth super-resolution. These works include visual difference-related terms in the optimized functional and report normal deviation, capturing visual similarity. While showing impressive results in many cases, they typically require prior segmentation of foreground objects and depend heavily on the quality of such segmentation.

Another strategy to tackle ambiguities in super-resolution is to design sophisticated regularizers to balance the data-fidelity terms against a structural image prior [15, 24, 56]. In contrast to this approach, which requires custom hand-crafted regularized objectives and optimization procedures, we focus on the standard training strategy (i.e., gradient-based optimization of a CNN) while using a loss function that captures visual similarity.

Yet another approach is to choose a carefully-designed model such as [63] featuring a sophisticated metric defined in a space of minimum spanning trees and including an explicit edge inconsistency model. In contrast to ours, such model requires manual tuning of multiple hyperparameters.

2.3 Perceptual photo super-resolution

Perceptual metrics have been considered more broadly in the context of photo processing. While convolutional neural networks for photo super-resolution trained with simple mean square or mean absolute color deviation keep demonstrating impressive results [16, 18, 59, 60], it has been widely recognized that pixelwise difference of color image data is not well correlated with perceptual image difference. For this reason, relying on a pixelwise color error may lead to suboptimal performance.

One solution is to instead use the loss function represented by the deviation of the features from a neural network trained for an image understanding task [25]. This idea can be further combined with an adversarial training procedure to push the super-resolution result to the natural image manifold [28]. Another extension to this idea is to train the neural network to generate images with natural distribution of statistical features [12, 36, 50, 51]. To balance between the perceptual quality and pixelwise color deviation, generative adversarial networks can be used [7, 31, 49].

Another solution is to learn a quality measure from perceptual scores, collected from a human subject study, and use this quality measure as the loss function. Such quality measure may capture similarity of two images [58] or an absolute naturalness of the image [32].

3 Metrics

In this section, we discuss visually-based metrics and how they can be used to evaluate the quality of depth map super-resolution and as loss functions. The general principle we follow is to apply comparison metrics to renderings of the depth maps to obtain a measure of their difference instead of considering depth maps directly. The difficulty with this approach is that there are infinitely many possible renderings depending on lighting conditions, material properties and camera position. However, we demonstrate that even a very simple rendering procedure already yields substantially improved results. We label visually-based metrics with subscript “v” and the metrics that compare the depth values directly with subscript “d”.

From depth map to visual representation.

To approximate the appearance of a 3D scene depicted with a certain depth map we use a simple rendering procedure. We illuminate the corresponding 3D surface with monochromatic directional light source and observe it with the same camera that the scene was originally acquired with. We use the diffuse reflection model and do not take visibility into account. For this model, the intensity of a pixel $(i,j)$ of the rendering $I$ is proportional to cosine of the angle between the normal at the point of the surface corresponding to the pixel $\bm{n}_{ij}$ and direction to the light source $\bm{e}$ : $I_{ij}=\bm{e}\cdot\bm{n}_{ij}$ . We calculate the normals from the depth maps using first-order finite-differences. Any number of vectors $\bm{e}$ can be used to generate a collection of renderings representing the depth map, however, any rendering can be obtained as a linear combination of three basis ones corresponding to independent light directions. Renderings for different light directions are presented in Figure 2.

Perceptual metrics.

We briefly describe two representative metrics: a statistics-based DSSIM, and a neural network-based LPIPS. Either of these can be applied to three basis renderings (or a larger sample of renderings) and reduced to obtain the final value. While, in principle, they can also be used as loss functions, the choice of a loss function needs to take stability and efficiency into account, so we opt for a more conservative choice described below.

Structural similarity index measure (SSIM) [52] takes into account the changes in the local structure of an image, captured by statistical quantities computed on a small window around each pixel. For each pair of pixels of the compared images $I_{k}$ , $k=1,2$ the luminance term $\ell$ , the contrast term $c$ and the structural term $s$ , each normalized, are computed using the means $\mu_{k}$ , standard deviations $\sigma_{k}$ and cross-covariance $\sigma_{12}$ of the pixels in the corresponding local windows. The value of SSIM is then computed as pixelwise mean product of these terms

[TABLE]

where $N$ is the number of pixels. Dissimilarity measure can be computed as $\mathrm{DSSIM_{v}}(I_{1},I_{2})=1-\mathrm{SSIM_{v}}(I_{1},I_{2})$ .

Neural net-based metrics rely on the idea of measuring the distance between features extracted from a neural network. Specifically, feature maps $\bm{x}_{k\ell}$ , $\ell=1\ldots L$ with spatial dimensions $H_{\ell}\times W_{\ell}$ are extracted from $L$ layers of the network for each of the compared images. In the simplest case, the metric value is then computed as pixelwise mean square difference of the feature maps, summed over the layers

[TABLE]

Learned perceptual image patch similarity (LPIPS) [58] adds a learned channel-wise weighting to the above formula and uses 5 layers from Alexnet [27] or VGG [41] or the first layer from Squezenet [23] as the CNN of choice.

Our visual difference-based metric.

While the metrics described above are good proxies for human evaluation of difference between depth map renderings, they are lacking as loss functions due to their complex landscapes. Optimization with DSSIM as the loss function may produce the results actually inferior with respect to DSSIM itself compared to a simpler loss function we define below, as illustrated in Figure 3. LPIPS has a complex energy profile typical for neural networks, and having a neural network as the loss function for another may behave unpredictably [61].

The simplest metric capturing the difference between all possible renderings of the depth maps $d_{k}$ can be computed as the average root mean square deviation of three basis renderings $\bm{e}_{m}\cdot\bm{n}_{k}$ in an orthogonal basis $\bm{e}_{1},\bm{e}_{2},\bm{e}_{3}$

[TABLE]

similarly to RMS difference of the normal maps.

We found that this simple metric for depth map comparison is efficient and stable as the loss function and at the same time, as we demonstrate in Section 5, it is well correlated with DSSIM and LPIPS, i.e., situations when the value of one metric is high and the value of another is low are unlikely. Our experiments confirm that optimization of this metric also improves both perceptual metrics.

4 Methods

We selected eight representative state-of-the-art depth processing methods based on different principles: (1) a purely variational method [14], (2) a bilateral filtering method that uses a high-resolution edge map [54], (3) a dictionary learning method [13], (4) a hybrid CNN-variational method [39], (5) a pure CNN [22], (6) a zero-shot CNN [47], (7) a densification [34] and (8) an enhancement [55] CNNs. Our goals were (a) to modify the methods for using with the visual difference-based loss function, and (b) to compare the results of the modified methods with alternatives of different types. In our experiments the last two methods did not perform well compared to others, so we did not consider them further. We found that two neural network-based methods (5) and (6), that we refer to as MSG and DIP, can be easily modified for using with a visual difference-based loss function, as we explain now.

MSG

[22] is a deep learning method that uses different strategies to upsample different spectral components of low-resolution depth map. In the modified version of this method, that we denote by MSG-V, we replaced the original loss function with a combination of our visual difference-based metric and mean absolute deviation of Laplacian pyramid $\mathrm{Lap}_{1}$ [2] as a regularizer

[TABLE]

DIP

[47] is a zero-shot deep learning approach, based on a remarkable observation that, even without any specialized training, the structure of CNN itself may be leveraged for solving inverse problems on images. We note that this approach naturally allows simultaneous super-resolution and inpainting. In this approach, the depth super-resolution problem would be formulated as

[TABLE]

where $d^{\mathrm{LR}}$ and $d_{\theta^{*}}^{\mathrm{SR}}$ are the low-resolution and super-resolved depth maps, $\mathrm{CNN}_{\theta}$ is the output of the deep neural network parametrised by $\theta$ , $\bm{\mathrm{D}}$ is the downsampling operator, and $\mathrm{MSE_{d}}$ is direct mean square difference of the depth maps. To perform photo-guided super-resolution, we added a second output channel for intensity to the network

[TABLE]

where $I^{\mathrm{HR}}$ is the high-resolution photo guidance, and for visually-based version DIP-V we further replaced the direct depth deviation $\mathrm{MSE_{d}}$ with the function from Equation 4.

We used the remaining four methods (1)-(4) for comparison as-is, as modifying them for a different loss function would require substantial changes to the algorithms.

SRfS

[14] is a variational method relying on complimentarity of super-resolution and shape-from-shading problems. It already includes a visual-difference based term (the remaining methods use depth difference metrics).

EG

[54] approaches the problem via prediction of smooth high-resolution depth edges with Markov random field optimization. It does not use a loss directly, therefore cannot be easily adapted.

DG

[13] is a depth map enhancement method based on dictionary learning that uses depth difference-based fidelity term. It makes a number of modeling choices which may not be suitable for a different loss function, and typically does not perform as well as neural network-based methods.

PDN

[39] is a hybrid method featuring two stages: the first is composed of fully-convolutional layers and predicts a rough super-resolved depth map, and the second performs an unrolled variational optimization, aiming to produce a sharp and noise-free result.

5 Experiments

5.1 Data

For evaluation we selected a representative and diverse set of 34 RGBD images featuring synthetic, high-quality real and low-quality real data with different levels of geometric and textural complexity. We employed four datasets, most common in literature on depth super-resolution. ICL-NUIM [17] includes photo-realistic RGB images along with synthetic depth, free from any acquisition noise. Middlebury 2014 [40], captured with a structured light system, provides high-quality ground truth for complex real-world scenes. SUN RGBD [42] contains images captured with four different consumer-level RGBD cameras: Intel RealSense, Asus Xtion, Microsoft Kinect v1 and v2. ToFMark [10] provides challenging real-world time-of-flight and intensity camera acquisitions together with an accurate ground truth from a structured light sensor.

In addition, we constructed a synthetic SimGeo dataset, that consists of 6 geometrically simple scenes with low- and high-frequency texture, and without any, using Blender. The purpose of SimGeo were to reveal artifacts that are not related to the noise or high-frequency geometry in the input data, like false geometric detail caused by color variation on a smooth surface.

We resized and cropped each RGBD image to the resolution of $512\times 512$ and generated low-resolution input depth maps with the scaling factors of 4 and 8, that are most common among the works on depth super-resolution. We focused on two downsampling models: Box, i.e., each low-resolution pixel contains the mean value over the “box” neighbouring high-resolution pixels, and Nearest neighbour, i.e., each low-resolution pixel contains the value of the nearest high-resolution pixel. For additional details on our evaluation data and the results for different downsampling models please refer to supplementary material.

5.2 Evaluation details

To quantify the performance of the methods, we measured direct RMS deviation of the depth maps (denoted by $\mathrm{RMSE_{d}}$ ) and deviation of their renderings with the metrics described in Section 3. For visually-based metrics we calculated their values for three orthogonal light directions, corresponding to the three left-most images in Figure 2, and the value for an additional light direction, corresponding to the right-most image. We then took the worst of the four values. With similar outcomes, we also explored different reducing strategies and a set of different metrics: BadPix and Bumpiness, applied directly to depth values, and BadPix and RMSE applied to separate depth map renderings.

Additionally, we conducted an informal perceptual study using the results on SimGeo, ICL-NUIM and Middlebury datasets, in which subjects were asked to choose the renderings of the upsampled depth maps that look most similar to the ground truth.

5.3 Implementation details

We evaluated publicly available trained models for EG, DG, and MSG and trained PDN using publicly available code; we used the implementation of SRfS provided by the authors; we adapted publicly available implementation of DIP for depth maps, as described in Section 4; we reimplemented MSG-V in PyTorch [37] and trained it according to the original paper using the patches from Middlebury and MPI Sintel [3]. We selected the value of the weighting parameter $w$ in Equation (4) so that both terms of the loss contribute equally with respect to their magnitudes (see supplementary material for more details).

5.4 Comparison of quality measures

To quantify how well different metrics represent the visual quality of a super-resolved depth map, we compared pairwise correlations of these metrics and calculated the corresponding values of Pearson correlation coefficient. Since LPIPS as a neural network-based perceptual metric has been experimentally shown to represent human perception well, we used its value as the reference. We found that the metrics based on direct depth deviation demonstrate weak correlation with perceptual metrics, as illustrated in Figure 4 for $\mathrm{RMSE_{d}}$ , and hence are not suitable for measuring the depth map quality when the visual appearance plays an important role. On the other hand, we found that our $\mathrm{RMSE_{v}}$ correlates well with perceptual metrics, to the same extent they correlate with each other (see Figure 4).

5.5 Comparison of super-resolution methods

In Table 1 and Figure 5 we present the super-resolution results on our SimGeo dataset with the scaling factor of 4; in Table 2 and Figure 6 we present the results on ICL-NUIM and Middlebury datasets with the scaling factors of 4 and 8. We use Box downsampling model in both cases. Please find the additional results in supplementary material or online111mega.nz/#F!yvRXBABI!pucRoBvtnthzHI1oqsxEvA!y6JmCajS.

In general, we found that the methods EG, PDN and DG do not recover fine details of the surface, typically oversmoothing the result in comparison to, e.g., Bicubic upsampling, the methods SRfS and original DIP suffer from false geometry artifacts in case of a smooth textured surface, and original MSG introduces severe noise around the depth edges. As illustrated in Figure 6 and Table 2, all the methods from prior works perform relatively poorly on the images with regions of missing depth measurements (rendered in black), including the ones that inpaint these regions explicitly (SRfS, DG) or implicitly (DIP). The method EG failed to converge on some images.

In contrast, we observed that integration of our visual difference-based loss into DIP and MSG significantly improved the results of both methods qualitatively and quantitatively. The visual difference-based version DIP-V do not suffer from false geometry artifacts as much as the original version. On the challenging images from Middlebury dataset, where it performed simultaneous super-resolution and inpainting, DIP-V mostly outperformed other methods as measured by the perceptual metrics and was preferred by more than 80% of subjects in the perceptual study. The visual difference-based version MSG-V produces significantly less noisy results in comparison to the original version, in some cases almost without any noticeable artifacts. On the data without missing measurements, including hole-filled “Vintage” from Middlebury, MSG-V mostly outperformed other methods as measured by the perceptual metrics and was preferred by more than 80% of subjects. On SimGeo, ICL-NUIM and Middlebury combined, one of our modified versions, DIP-V or MSG-V, was preferred over the other methods by more than 85% of subjects.

For reference, in Figure 5 we include pseudo-color visualizations of the depth maps. Notice that while the upsampled depth maps obtained with different methods are almost indistinguishable in this form of visualization, commonly used in the literature on depth processing for qualitative evaluation, the corresponding renderings and, consequently, the underlying geometry varies dramatically.

6 Conclusion

We have explored depth map super-resolution with a simple visual difference-based metric as the loss function. Via comparison of this metric with a variety of perceptual quality measures, we have demonstrated that it can be considered a reasonable proxy for human perception in the problem of depth super-resolution with the focus on visual quality of the 3D surface. Via an extensive evaluation of several depth-processing methods on a range of synthetic and real data, we have demonstrated that using this metric as the loss function yields significantly improved results in comparison to the common direct pixel-wise deviation of depth values. We have combined our metric with relatively simple and non-specific deep learning architectures and expect that this approach will be beneficial for other related problems.

We have focused on the case of single regularly sampled RGBD images, but a lot of geometric data has less regular form. The future work would be to adapt the developed methodology to a more general sampling of the depth values for the cases of multiple RGBD images or point clouds annotated with a collection of RGB images.

Acknowledgements

The work was supported by The Ministry of Education and Science of Russian Federation, grant No. 14.615.21.0004, grant code: RFMEFI61518X0004.

The authors acknowledge the usage of the Skoltech CDISE HPC cluster Zhores for obtaining the results presented in this paper.

Supplementary material

Appendix A Additional evaluation details

In the literature on range image processing, the term depth is used to denote three different types of range data:

•

disparity, presented in, e.g., Middlebury dataset, i.e., the difference in image location of a feature within two stereo images;

•

orthogonal depth, presented in, e.g., SUN-RGBD dataset, i.e., the distance from a point in the 3D-space to the image plane;

•

perspective depth, presented in, e.g., low-resolution scans in ToFMark dataset, i.e., the distance from a point in the 3D-space to the camera.

We use the term depth map to denote any data of this kind, however, in our experiments we evaluated each super-resolution method on the range type that it was designed for. For evaluation of the disparity processing methods on the datasets that do not provide disparity maps, we calculated virtual disparity images with the baseline of 20 cm.

Here we describe the quality measures that we considered in addition to the ones discussed in the main text. We recall that we label the metrics that compare the depth values directly with subscript “d”, and the visually-based metrics with subscript “v”.

BadPix is the fraction of measurements with absolute deviation larger than a certain threshold $\tau$

[TABLE]

or the fraction of measurements with relative deviation larger than a threshold

[TABLE]

where $d_{1}$ and $d_{2}$ are the compared depth maps, $ij$ represents individual pixels, and $N$ is the number of pixels. We considered BadPix for depth map comparison with absolute thresholds of 1, 5, and 10 cm and relative thresholds of 1, 5, and 10%. We also considered this metric for comparison of depth map renderings with the absolute thresholds of 1, 5, and 10 each divided by 255 (which correspond to deviations by the respective numbers of shades of gray in 8-bit grayscale images).

Bumpiness, introduced in [20] for piece-wise planar regions and generalized in [19] for arbitrary smooth surfaces, is a measure of surface smoothness with respect to a reference. It is calculated as

[TABLE]

where $\lVert\cdot\rVert_{\mathrm{F}}$ is Frobenius norm and $\mathrm{H}_{f}(i,j)$ is the Hessian matrix of the function $f$ , calculated at point $(i,j)$ . We used the original implementation of this metric. Since this metric includes some parameter values, presumably, specific for the original evaluation dataset, we converted the depth maps to disparity using the camera intrinsics of this dataset.

We used the implementation of $\mathrm{SSIM}$ from scikit-image [48] and the original implementation of $\mathrm{LPIPS}$ from [58].

In addition to our $\mathrm{RMSE_{v}}$ we considered RMS difference of two rendered images without averaging over the basis renderings, i.e., calculated for a single lighting condition. We denote this metric as $\mathrm{RMSE_{v}^{1}}$ : for a light direction $\bm{e}$ and a pair of normal maps $\bm{n}_{1},\bm{n}_{2}$ it is calculated as

[TABLE]

Appendix B Comparison of quality measures

In Figures 11-16 we compare the relations between different subsets of quality measures. We present pair-wise correlations of the metrics in the form of scatter plots in the lower half of the figure and Pearson and Spearman correlation coefficients in the upper half of the figure. For reference, on the diagonal of the figure we also include kernel density estimates of metric value distributions for each super-resolution method. The distributions for the modified methods DIP-v and MSG-v are represented with the dashed black and solid black curves respectively.

On the depth maps with missing measurements, the methods that do not inpaint the regions with the missing measurements (including MSG-v) sometimes produced severe outliers around these regions. To minimize the influence of such outliers on the results of the metric comparison, we limited the value of $\mathrm{RMSE_{d}}$ to a maximum of 0.5 meters. Among the collected super-resolved images, 8% exceeded this threshold.

For each metric, applied to rendered images, we gathered the values of this metric for four different light directions, as described in Section 5.2 of the main text. We then calculated two additional values, the worst and the average of these four. We label the respective versions of the metric with suffixes $\bm{e}_{1}$ , $\bm{e}_{2}$ , $\bm{e}_{3}$ , $\bm{e}_{4}$ , $\mathrm{max}$ and $\mathrm{avg}$ . For each metric, we found that these six versions are strongly correlated, as illustrated in Figures 11-13, so we further focused on the worst value of each metric.

We also found that different versions of $\mathrm{RMSE_{v}^{1}}$ produce very similar results to our $\mathrm{RMSE_{v}}$ , as illustrated in Figure 11. It is consistent with the observation that if $\mathrm{RMSE_{v}}$ is bounded by a constant $C$ , then for any choice of the light direction $\bm{e}$ , $\mathrm{RMSE_{v}^{1}}$ is bounded by $C$ , which can be easily seen from the fact that $\mathrm{RMSE_{v}}$ does not depend on the choice of the basis, so we can choose one of the basis light directions to be equal to $\bm{e}$ .

In Figure 14 we compare the metrics of different types: pixel-wise $\mathrm{RMSE_{d}}$ , $\mathrm{BadPix_{d}(5cm)}$ and $\mathrm{BadPix_{d}(5\%)}$ applied to depth directly; “worst” versions of pixel-wise $\mathrm{BadPix_{v}(5)}$ and perceptual $\mathrm{DSSIM_{v}}$ and $\mathrm{LPIPS_{v}}$ , applied to rendered images; geometrical $\mathrm{Bumpiness_{d}}$ and our $\mathrm{RMSE_{v}}$ . We found that all three pixel-wise metrics applied to depth directly demonstrate weak correlation with visual and geometrical metrics. Pixel-wise $\mathrm{BadPix_{v}(5)}$ applied to rendered images, although strongly correlated with perceptual metrics, is inappropriate for gradient-based optimization. Additional comparison of pixel-wise $\mathrm{BadPix_{d}}$ and $\mathrm{BadPix_{v}}$ with different thresholds to perceptual $\mathrm{DSSIM_{v}}$ and $\mathrm{LPIPS_{v}}$ (Figures 15 and 16) leads to the same conclusions. $\mathrm{Bumpiness_{d}}$ is also strongly correlated with perceptual metrics but only measures local curvature deviation, while the visual appearance of 3D surface is determined by its local orientation.

Appendix C Comparison of super-resolution methods

In Tables 4-11 we present the results of quantitative evaluation of super-resolution methods on the datasets SimGeo, ICL-NUIM and Middlebury for Box downsampling model and scaling factors of 4 and 8. In Table 3 we present the average values. $\mathrm{RMSE_{d}}$ is in millimeters, $\mathrm{BadPix}$ is in percents, $\mathrm{DSSIM_{v}}$ , $\mathrm{LPIPS_{v}}$ and $\mathrm{RMSE_{v}}$ are in thousandths. For all visual metrics except $\mathrm{RMSE_{v}}$ the presented value is of the “worst” version. For all metrics the lower value corresponds to the better result. The best results are highlighted in bold and the second best results are underlined.

In addition to metric values, the last three columns of the tables contain the results of the informal perceptual study collected over approximately 250 subjects. In this study, for each scene from SimGeo, ICL-NUIM and Middlebury datasets subjects were shown the renderings of super-resolved depth maps, shuffled randomly, and were asked to choose the renderings, the most and second most similar to ground truth. The renderings calculated with the fourth light direction were used. The values in the columns “User, 1st”, “User, 2nd”, and “Top 2” represent the percentages of the subjects who chose the rendering of the super-resolved depth map, produced by the method in the corresponding method, as the most similar, second most similar, or one of the two most similar to the ground truth respectively. We found that our $\mathrm{RMSE_{v}}$ is mostly consistent with human judgements.

Appendix D Training with $\mathrm{MSE_{v}}$

Since optimization of $\mathrm{MSE_{v}}$ alone is an ill-posed problem, we used a regularization term that penalizes absolute depth deviation. We found that among different regularizers, including $\mathrm{MSE_{d}}$ , $\mathrm{Lap_{1}}$ produces the best results. In general, we found that optimization leads to the best results if the terms are weighted in such way that geometrically corresponding depth error and angular normal error result in the same magnitudes of terms. The corresponding value of the weighing parameter $w$ in Equation 4 of the main text is determined by the properties of the training data, such as depth map scaling or field of view of the camera.

Appendix E Noisy depth measurements in the input

SimGeo, ICL-NUIM and Middlebury datasets were our primary evaluation sets, yielding the most pronounced outcomes, however, these datasets contain only noise-free scenes. As we were interested in evaluation of our approach on a diverse set of RGBD images, we included twelve scenes from SUN RGBD dataset and three scenes from ToFMark dataset that feature real-world noise patterns in our evaluation data. We observed that increased levels of noise are extremely harmful to all non over-smoothing methods, including those modified with our loss, as they fail to produce reasonable super-resolution results, as illustrated in Figures 9-10. To demonstrate that this is not a limitation of our approach, in Figure 7 we present the super-resolution results produced by modified and unmodified versions of MSG, trained on the data with synthetic multiplicative gaussian noise.

Appendix F Different downsampling models

In Figure 8 we present the results for different downsampling models, used for calculation of low-resolution input. We found that the visual quality remains high when the downsampling model used during training and that of the input match; if this is not the case, the quality deteriorates, as expected.

Bibliography63

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Gianluca Agresti, Ludovico Minto, Giulio Marin, and Pietro Zanuttigh. Deep learning for confidence information in stereo and tof data fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 697–705, 2017.
2[2] Piotr Bojanowski, Armand Joulin, David Lopez-Pas, and Arthur Szlam. Optimizing the latent space of generative networks. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning , volume 80 of Proceedings of Machine Learning Research , pages 600–609, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
3[3] Daniel J. Butler, Jonas Wulff, Garrett B. Stanley, and Michael J. Black. A naturalistic open source movie for optical flow evaluation. In A. Fitzgibbon et al. (Eds.), editor, European Conf. on Computer Vision (ECCV) , Part IV, LNCS 7577, pages 611–625. Springer-Verlag, Oct. 2012.
4[4] Baoliang Chen and Cheolkon Jung. Single depth image super-resolution using convolutional neural networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 1473–1477. IEEE, 2018.
5[5] Zhao Chen, Vijay Badrinarayanan, Gilad Drozdov, and Andrew Rabinovich. Estimating depth from rgb and sparse sensing. Co RR , abs/1804.02771, 2018.
6[6] Xinjing Cheng, Peng Wang, and Ruigang Yang. Depth estimation via affinity learned with convolutional spatial propagation network. In European Conference on Computer Vision , pages 108–125. Springer, Cham, 2018.
7[7] Manri Cheon, Jun-Hyuk Kim, Jun-Ho Choi, and Jong-Seok Lee. Generative adversarial network-based image super-resolution using perceptual content losses. In Laura Leal-Taixé and Stefan Roth, editors, ECCV Workshops , pages 51–62, Cham, 2019. Springer International Publishing.
8[8] Nathaniel Chodosh, Chaoyang Wang, and Simon Lucey. Deep convolutional compressed sensing for lidar depth completion. ar Xiv preprint ar Xiv:1803.08949 , 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Perceptual Deep Depth Super-Resolution

Abstract

1 Introduction

2 Related work

2.1 Image quality measures

2.2 Depth super-resolution

Convolutional neural networks

Dictionary learning

Variational approach

2.3 Perceptual photo super-resolution

3 Metrics

From depth map to visual representation.

Perceptual metrics.

Our visual difference-based metric.

4 Methods

MSG

DIP

SRfS

EG

DG

PDN

5 Experiments

5.1 Data

5.2 Evaluation details

5.3 Implementation details

5.4 Comparison of quality measures

5.5 Comparison of super-resolution methods

6 Conclusion

Acknowledgements

Supplementary material

Appendix A Additional evaluation details

Appendix B Comparison of quality measures

Appendix C Comparison of super-resolution methods

Appendix D Training with MSEv\mathrm{MSE_{v}}MSEv​

Appendix E Noisy depth measurements in the input

Appendix F Different downsampling models

Appendix D Training with $\mathrm{MSE_{v}}$