Embodied Spatial Intelligence: from Implicit Scene Modeling to Spatial Reasoning

Jiading Fang

arXiv:2509.00465·cs.RO·September 3, 2025

Embodied Spatial Intelligence: from Implicit Scene Modeling to Spatial Reasoning

Jiading Fang

PDF

Open Access

TL;DR

This thesis advances embodied spatial intelligence by integrating implicit scene modeling and enhanced spatial reasoning in robots, enabling better perception and language-based decision-making in real-world environments.

Contribution

It introduces novel scene representations and spatial reasoning methods that bridge language models and physical embodiment for robotic perception and action.

Findings

01

Developed scalable implicit neural scene representations.

02

Created a new navigation benchmark for spatial reasoning.

03

Enhanced LLMs with grounding and feedback mechanisms.

Abstract

This thesis introduces "Embodied Spatial Intelligence" to address the challenge of creating robots that can perceive and act in the real world based on natural language instructions. To bridge the gap between Large Language Models (LLMs) and physical embodiment, we present contributions on two fronts: scene representation and spatial reasoning. For perception, we develop robust, scalable, and accurate scene representations using implicit neural models, with contributions in self-supervised camera calibration, high-fidelity depth field generation, and large-scale reconstruction. For spatial reasoning, we enhance the spatial capabilities of LLMs by introducing a novel navigation benchmark, a method for grounding language in 3D, and a state-feedback mechanism to improve long-horizon decision-making. This work lays a foundation for robots that can robustly perceive their surroundings and…

Tables19

Table 1. Table 2.1 : Mean reprojection error on EuRoC at 256 × 384 256\times 384 resolution for UCM, EUCM and DS models using (left) AprilTag-based toolbox calibration Basalt [ 2 ] and (right) our self-supervised learned (L) calibration. Note that despite using no ground-truth calibration targets, our self-supervised procedure produces sub-pixel reprojection error.

Method	Mean Reprojection Error
Method	Target-based	Learned
Pinhole	1.950	2.230
UCM [51]	0.145	0.249
EUCM [52]	0.144	0.245
DS [1]	0.144	0.344

Table 2. Table 2.2 : Intrinsic calibration evaluation of different methods on the EuRoC dataset, where B denotes intrinsics obtained from Basalt, and L denotes learned intrinsics.

Method	$f_{x}$	$f_{y}$	$c_{x}$	$c_{y}$	$α$	$β$	$ξ$	$w$
UCM (L)	237.6	247.9	187.9	130.3	0.631	—	—	—
UCM (B)	235.4	245.1	186.5	132.6	0.650	—	—	—
EUCM (L)	237.4	247.7	186.7	129.1	0.598	1.075	—	—
EUCM (B)	235.6	245.4	186.4	132.7	0.597	1.112	—	—
DS (L)	184.8	193.3	187.8	130.2	0.561	—	-0.232	—
DS (B)	181.4	188.9	186.4	132.6	0.571	—	-0.230	—

Table 3. Table 2.3 : EUCM perturbation test results. With perturbed initialization, all intrinsic parameters achieve sub-pixel convergence for mean reprojection error ( MRE ), with only a small offset to the Basalt calibration numbers.

Perturbation	$f_{x}$	$f_{y}$	$c_{x}$	$c_{y}$	$α$	$β$	MRE
$I_{1.10}$ init	242.3	253.6	189.5	130.7	0.5984	1.080	0.409
$I_{1.05}$ init	241.3	252.3	188.5	130.5	0.5981	1.078	0.367
$I_{c}$ init	240.2	251.4	187.9	130.0	0.5971	1.076	0.348
$I_{0.95}$ init	239.5	250.9	187.8	129.2	0.5970	1.076	0.332
$I_{0.90}$ init	238.8	249.6	187.7	129.1	0.5968	1.071	0.298
$I_{c}$	235.6	245.4	186.4	132.7	0.597	1.112	0.144

Table 4. Table 2.4 : Quantitative depth evaluation on the KITTI [ 3 ] dataset , using the standard Eigen split and the Garg crop, for distances up to 80m (with median scaling). K and L( ⋅ \cdot ) denote known and learned intrinsics, respectively. P means pinhole model.

Method	Camera	Abs Rel $↓$	Sq Rel $↓$	RMSE $↓$	$δ_{1.25}$ $↑$
Gordon et al. [4]	K	0.129	0.982	5.23	0.840
Gordon et al. [4]	L(P)	0.128	0.959	5.23	0.845
Vasiljevic et al. [44]	K(NRS)	0.137	0.987	5.33	0.830
Vasiljevic et al. [44]	L(NRS)	0.134	0.952	5.26	0.832
Ours	L(P)	0.129	0.893	4.96	0.846
Ours	L(UCM)	0.126	0.951	4.89	0.858

Table 5. Table 2.5 : Quantitative depth evaluation of different methods on the EuROC [ 3 ] dataset , using the evaluation procedure in [ 4 ] with center cropping. The training data consists of “Machine Room” sequences and the evaluation is on the ”Vicon Room 201” sequence (with median scaling). PN means plum-bob model.

Method	Camera	Abs Rel $↓$	Sq Rel $↓$	RMSE $↓$	$α_{1}$ $↑$
Gordon et al. [4]	PB	0.332	0.389	0.971	0.420
Vasiljevic et al. [44]	NRS	0.303	0.056	0.154	0.556
Ours	UCM	0.282	0.048	0.141	0.591
Ours	EUCM	0.278	0.047	0.135	0.598
Ours	DS	0.278	0.049	0.141	0.584

Table 6. Table 2.6 : Quantitative multi-dataset depth evaluation on EuRoC (without cropping and with median scaling).

Dataset	Abs Rel $↓$	Sq Rel $↓$	RMSE $↓$	$α_{1}$ $↑$	$α_{2}$ $↑$	$α_{3}$ $↑$
EuRoC [4]	0.265	0.042	0.130	0.600	0.882	0.966
EuRoC+KITTI	0.244	0.044	0.117	0.742	0.907	0.961

Table 7. Table 2.7 : Ablation study for ScanNet-Stereo , using different variations.

Variation		Lower is better $↓$			Higher is better $↑$
Variation		Abs. Rel	Sq. Rel	RMSE	$δ_{1.25}$	$δ_{{1.25}^{2}}$	$δ_{{1.25}^{3}}$
1	Depth-Only	0.098	0.046	0.257	0.902	0.972	0.990
2	w/ Conv. RGB encoder [85]	0.114	0.058	0.294	0.866	0.961	0.982
3	w/ 64-dim R18 RGB encoder	0.104	0.049	0.270	0.883	0.966	0.985
4	w/o camera information	0.229	0.157	0.473	0.661	0.874	0.955
5	w/o global rays encoding	0.097	0.047	0.261	0.897	0.962	0.988
6	w/ equal loss weights	0.095	0.047	0.259	0.908	0.968	0.990
7	w/ epipolar cues [17]	0.094	0.048	0.254	0.905	0.972	0.990
8	w/o Augmentations	0.117	0.060	0.291	0.870	0.959	0.981
9	w/o Virtual Cameras	0.104	0.058	0.268	0.891	0.965	0.986
10	w/o Canonical Jittering	0.099	0.046	0.261	0.897	0.970	0.988
11	w/o Canonical Randomization	0.096	0.044	0.253	0.905	0.971	0.989
	DeFiNe	0.093	0.042	0.246	0.911	0.974	0.991

Table 8. Table 2.8 : Depth estimation results on ScanNet and 7-Scenes . DeFiNe is competitive with other state-of-the-art methods on ScanNet, and outperforms all published methods in zero-shot transfer to 7-Scenes by a large margin.

Method	Abs. Rel $↓$	Sq. Rel $↓$	RMSE $↓$	Speed (ms) $↓$
ScanNet test split [125]
DeMoN [109]	0.231	0.520	0.761	110
MiDas-v2 [126]	0.208	0.318	0.742	-
BA-Net [125]	0.091	0.058	0.223	95
CVD [111]	0.073	0.037	0.217	2400
DeepV2D [79]	0.057	0.010	0.168	690
NeuralRecon [114]	0.047	0.024	0.164	30
DeFiNe ( $128 \times 192$ )	0.059	0.022	0.184	49
DeFiNe ( $240 \times 320$ )	0.056	0.019	0.176	78
Zero-shot transfer to 7-Scenes [91]
DeMoN [109]	0.389	0.420	0.855	110
NeuralRGBD [127]	0.176	0.112	0.441	202
DPSNet [104]	0.199	0.142	0.438	322
DeepV2D [79]	0.437	0.553	0.869	347
CNMNet [128]	0.161	0.083	0.361	80
NeuralRecon [114]	0.155	0.104	0.347	30
EST [17]	0.118	0.052	0.298	71
DeFiNe ( $128 \times 192$ )	0.100	0.039	0.253	49

Table 9. Table 2.9 : Blending results on Object-Centric Indoor Scenes. IDW-Sample works the best for all metrics with both ground-truth and estimated transformations. Results with estimated T ^ B A \hat{T}_{BA} are only marginally worse than those with ground-truth T B A T_{BA} , which demonstrates that our proposed NeRF registration is accurate enough for the downstream blending task.

Blending	Ground-truth $T_{B A}$			Estimated ${\hat{T}}_{B A}$
Blending	PSNR $↑$	SSIM $↑$	LPIPS $↓$	PSNR $↑$	SSIM $↑$	LPIPS $↓$
NeRF	20.92	0.716	0.369	20.90	0.714	0.370
Nearest (Block-NeRF)	23.81	0.779	0.283	23.68	0.774	0.287
IDW-2D (Block-NeRF)	24.70	0.795	0.267	24.64	0.792	0.267
IDW-3D (Block-NeRF)	23.48	0.776	0.279	23.45	0.772	0.280
IDW-Sample (Ours)	24.91	0.813	0.228	24.83	0.810	0.229

Table 10. Table 2.10 : Registration results on ScanNet. We compare to point-cloud registration methods on both NeRF-extracted point-cloud and ground-truth RGB-D-fusion point-cloud. Due to the noisy geometry of NeRF reconstructions, registration performance on NeRF-extracted point-clouds is inferior. However, NeRFuser is comparable to the registration performance on RGB-D-fused methods in terms of r err r_{\textrm{err}} and t err t_{\textrm{err}} , while having the highest success rate. Unlike point-cloud baselines, our method also recovers the relative scale. Bold numbers are the best, italic numbers are second best.

Registration	$r_{err} (^{\circ})$	$t_{err}$	$s_{err}$	Success
NeRF-extracted point-cloud
ICP [172]	3.027	0.1151	N/A	0.13
FGR [173]	4.549	0.1844	N/A	0.04
FPFH [174]	2.805	0.0381	N/A	0.17
RGB-D-fused point-cloud
ICP [172]	1.598	0.0816	N/A	0.17
FGR [173]	1.330	0.0372	N/A	0.71
FPFH [174]	0.049	0.0205	N/A	0.79
NeRFuser	0.588	0.0315	0.0211	0.84

Table 11. Table 2.11 : Blending results on Mission Bay dataset. IDW-Sample performs the best for all metrics.

Blending	PSNR $↑$	SSIM $↑$	LPIPS $↓$
NeRF	17.306	0.571	0.502
Nearest	19.070	0.657	0.398
IDW-2D	19.692	0.659	0.413
IDW-3D	18.806	0.636	0.433
IDW-Sample	19.986	0.678	0.388

Table 12. ((a)) Pairwise comparison on easy (lower left) and hard (higher right) DF questions.

Method	RWKV	Llama-2	Claude-1	Claude-2	GPT-3.5	GPT-4	$\bar{HARD} \|$
RWKV	*	0.20 — 0.24	0.19 — 0.41	0.19 — 0.51	0.19 — 0.33	0.19 — 0.62	*
Llama-2	0.43 — 0.20	*	0.24 — 0.41	0.24 — 0.45	0.24 — 0.31	0.24 — 0.66	*
Claude-1	0.74 — 0.19	0.78 — 0.41	*	0.36 — 0.44	0.38 — 0.32	0.36 — 0.57	*
Claude-2	0.82 — 0.19	0.85 — 0.41	0.81 — 0.72	*	0.44 — 0.32	0.44 — 0.58	*
GPT-3.5	0.59 — 0.19	0.61 — 0.42	0.57 — 0.74	0.57 — 0.83	*	0.32 — 0.59	*
GPT-4	0.86 — 0.19	0.90 — 0.42	0.84 — 0.72	0.83 — 0.81	0.86 — 0.57	*	*
— $\underline{EASY}$	*	*	*	*	*	*	*

Table 13. ((a)) Pairwise comparison on easy (lower left) and hard (higher right) DF questions.

Method	RWKV	Llama-2	Claude-1	Claude-2	GPT-3.5	GPT-4	$\bar{HARD} \|$
RWKV	*	0.20 — 0.24	0.19 — 0.41	0.19 — 0.51	0.19 — 0.33	0.19 — 0.62	*
Llama-2	0.43 — 0.20	*	0.24 — 0.41	0.24 — 0.45	0.24 — 0.31	0.24 — 0.66	*
Claude-1	0.74 — 0.19	0.78 — 0.41	*	0.36 — 0.44	0.38 — 0.32	0.36 — 0.57	*
Claude-2	0.82 — 0.19	0.85 — 0.41	0.81 — 0.72	*	0.44 — 0.32	0.44 — 0.58	*
GPT-3.5	0.59 — 0.19	0.61 — 0.42	0.57 — 0.74	0.57 — 0.83	*	0.32 — 0.59	*
GPT-4	0.86 — 0.19	0.90 — 0.42	0.84 — 0.72	0.83 — 0.81	0.86 — 0.57	*	*
— $\underline{EASY}$	*	*	*	*	*	*	*

Table 14. Table 3.2 : Grounding accuracy (%) on Nr3D and Sr3D. † denotes results from the official benchmarks while § denotes results reported in the respective papers. “P”: “with principles”, “NP”: “no principles”. All our models are equipped with interactive code generation. Transcrib3D with GPT-4 and general principles surpasses all baselines by a large margin.

	Nr3D			Sr3D
Method	Overall	Easy	Hard	Overall	Easy	Hard
SAT^† [254]	49.2	56.3	42.4	57.9	61.2	50.0
BUTD-DETR^† [179]	54.6	60.7	48.4	67.0	68.6	63.2
MVT^† [253]	59.5	67.4	52.7	64.5	66.9	58.8
ViL3DRel^§ [260]	64.4	70.2	57.4	72.8	74.9	67.9
3D-VisTA^§ [180]	64.2	72.1	56.7	76.4	78.8	71.3
Transcrib3D (GPT-3.5-NP)	33.8	42.6	25.0	79.3	82.8	70.7
Transcrib3D (GPT-3.5-P)	46.6	56.0	37.1	80.0	80.8	78.1
Transcrib3D (GPT-4-NP)	64.5	71.8	57.1	97.4	98.4	94.7
Transcrib3D (GPT-4-P)	70.2	79.7	60.3	98.4	99.2	96.2

Table 15. Table 3.3 : Grounding accuracy (%) on ScanRefer. “Full”: the full validation set of ScanRefer consisting of 5410 samples; “Part.”: a subset of 500 random samples from the validation set, which is the same for all methods; “Det.”: the 3D object detection module used in the model; “PG” stands for PointGroup [ 5 ] , while “M3D” stands for Mask3D [ 6 ] (where the detection accuracy is 56.7 for PG and 73.7 for M3D on the ScanNet dataset for the [email protected] metric [ 6 ] ), and “GT” for ground-truth bounding boxes. We test our method and re-run 3D-VisTA on the same subset of 500 samples with the Mask3D detector, GT bounding boxes, and an additional “+ Cam” setting, where camera view information provided by the ScanRefer dataset is also included in the scene transcript. Note that ScanRefer allows the use of all provided data modalities and ranks methods on the same benchmark regardless. The zero-shot nature of our method allows ease use of this extra information.

Method	Data	Det.	Overall
Method	Data	Det.	[email protected]	[email protected]
ViL3DRel [260]	Full	PG	47.9	37.7
3D-VisTA [180]	Full	M3D	50.6	45.8
3D-VisTA (re-run)	Full	M3D	50.7	45.9
3D-VisTA (re-run)	Part.	M3D	50.6	44.6
Transcrib3D	Part.	M3D	51.2	44.4
Transcrib3D + Cam	Part.	M3D	51.3	45.5
3D-VisTA (re-run)	Part.	GT	55.6	55.6
Transcrib3D	Part.	GT	62.0	62.0
Transcrib3D + Cam	Part.	GT	64.2	64.2

Table 16. Table 3.4 : Performance of fine-tuning models on NR3D. Fine-tuned GPT-3.5 models demonstrate a significant improvement in performance when compared to zero-shot models, closely approaching the capabilities of GPT-4. The model fine-tuned on self-corrected examples sees an increase in performance compared to that fine-tuned on only correct examples, particularly for hard queries. Notably, the fine-tuned models are not provided with rule-based prompts during both fine-tuning and inference time, suggesting that implicit decision rules are learned from examples.

Models	Total	Easy	Hard
GPT-3.5 (zero-shot)	33.8	42.6	24.0
GPT-3.5 (correct-only fine-tuning)	60.7	52.8	41.9
GPT-3.5 (self-correct fine-tuning)	61.5	54.6	80.0
GPT-4 (zero-shot)	69.4	78.7	60.0

Table 17. Table 3.5 : Episode success rates and individual step success rates (in parentheses) for each sequential task. † indicates that the context limit was often exceeded.

	Pick & Place	Disinfection	Weight
CaP	$0.00$ $(0.54)$	$0.00$ $(0.68)$ ^†	$0.00$ $(0.84)$
CaP+CoT	$0.25$ $(0.76)$	$0.00$ $(0.20)$ ^†	$0.30$ $(0.88)$
Statler (ours)	$0.50$ $(0.88)$	$0.40$ $(0.82)$ ^†	$0.55$ $(0.93)$

Table 18. Table 3.6 : Success rates of Code-as-Policies (CaP) and Statler for non-temporal and temporal queries.

	Non-temporal		Temporal
	CaP	Statler (ours)	CaP	Statler (ours)
Pick & Place	$1.00$ $(62 / 62)$	$1.00$ $(68 / 68)$	$0.31$ $(9 / 29)$	$0.83$ $(𝟒𝟖 / 𝟓𝟖)$
Disinfection	$0.99$ $(148 / 149)$	$0.98$ $(164 / 168)$	$0.05$ $(1 / 20)$	$0.65$ $(𝟏𝟓 / 𝟐𝟑)$
Weight	$1.00$ $(107 / 107)$	$1.00$ $(107 / 107)$	$0.00$ $(0 / 20)$	$0.55$ $(𝟏𝟏 / 𝟐𝟎)$

Table 19. Table 3.7 : Ablation episode (individual step) success rates.

	Pick & Place	Disinfection	Weight
Statler w/o State	$0.00$ $(0.54)$	$0.00$ $(0.68)$	$0.00$ $(0.84)$
Statler-Unified	$0.40$ $(0.85)$	$0.35$ $(0.79)$	$0.50$ $(0.92)$
Statler-Auto	$0.75$ $(0.88)$	$0.45$ $(0.82)$	$0.40$ $(0.90)$
Statler (ours)	$0.50$ $(0.88)$	$0.40$ $(0.82)$	$0.55$ $(0.93)$

Equations27

M : O \to T_{x} R^{3} .

M : O \to T_{x} R^{3} .

M = F \circ V,

M = F \circ V,

V : O \to T,

V : O \to T,

F : T \to A \subseteq T_{x} R^{3} .

F : T \to A \subseteq T_{x} R^{3} .

\hat{p}^{t} = π (\hat{R}^{t \to c} ϕ (p^{t}, \hat{d}^{t}, i) + \hat{t}^{t \to c}, i),

\hat{p}^{t} = π (\hat{R}^{t \to c} ϕ (p^{t}, \hat{d}^{t}, i) + \hat{t}^{t \to c}, i),

π (P, i) = [f_{x} \frac{x}{α d + ( 1 - α ) z} f_{y} \frac{y}{α d + ( 1 - α ) z}] + [c_{x} c_{y}]

π (P, i) = [f_{x} \frac{x}{α d + ( 1 - α ) z} f_{y} \frac{y}{α d + ( 1 - α ) z}] + [c_{x} c_{y}]

ϕ (p, \hat{d}, i) = \hat{d} \frac{ξ + 1 + ( 1 - ξ ^{2} ) r ^{2}}{1 + r ^{2}} m_{x} m_{y} 1 - 00 \hat{d} ζ

ϕ (p, \hat{d}, i) = \hat{d} \frac{ξ + 1 + ( 1 - ξ ^{2} ) r ^{2}}{1 + r ^{2}} m_{x} m_{y} 1 - 00 \hat{d} ζ

m_{x}

m_{x}

r^{2}

\textbf{o}_{j}=-\mathbf{R}_{j}\mathbf{t}_{j}\quad,\quad\textbf{r}_{ij}=\big{(}\mathbf{K}_{j}\mathbf{R}_{j}\big{)}^{-1}\left[u_{ij},v_{ij},1\right]^{T}+\textbf{t}_{j}

\textbf{o}_{j}=-\mathbf{R}_{j}\mathbf{t}_{j}\quad,\quad\textbf{r}_{ij}=\big{(}\mathbf{K}_{j}\mathbf{R}_{j}\big{)}^{-1}\left[u_{ij},v_{ij},1\right]^{T}+\textbf{t}_{j}

x \mapsto [x, sin (f_{1} π x), cos (f_{1} π x), \dots, sin (f_{K} π x), cos (f_{K} π x)]^{⊤}

x \mapsto [x, sin (f_{1} π x), cos (f_{1} π x), \dots, sin (f_{K} π x), cos (f_{K} π x)]^{⊤}

\mathcal{L}=\mathcal{L}_{d}+\lambda_{s}\mathcal{L}_{s}+\lambda_{v}\big{(}\mathcal{L}_{d,v}+\lambda_{s}(\mathcal{L}_{s,v})\big{)}

\mathcal{L}=\mathcal{L}_{d}+\lambda_{s}\mathcal{L}_{s}+\lambda_{v}\big{(}\mathcal{L}_{d,v}+\lambda_{s}(\mathcal{L}_{s,v})\big{)}

q_{d} = \int_{d}^{\infty} T (t) σ (t) d t,

q_{d} = \int_{d}^{\infty} T (t) σ (t) d t,

I^{(j)} = k \sum i \sum w_{i, k} \overset{p}{ˉ}_{i, k} \overset{ˉ}{c}_{i, k}

I^{(j)} = k \sum i \sum w_{i, k} \overset{p}{ˉ}_{i, k} \overset{ˉ}{c}_{i, k}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Spatial Cognition and Navigation

Full text

Embodied Spatial Intelligence: From 3D Perception to 3D Reasoning

Jiading Fang

Toyota Technological Institute at Chicago (TTIC)

**EMBODIED SPATIAL INTELLIGENCE: FROM IMPLICIT SCENE MODELING TO SPATIAL REASONING

** by

Jiading Fang

A thesis submitted

in partial fulfillment of the requirements for

the degree of

Doctor of Philosophy in Computer Science

at the

TOYOTA TECHNOLOGICAL INSTITUTE AT CHICAGO

Chicago, Illinois

August, 2024

Thesis Committee:

Matthew R. Walter (Thesis Advisor)

Gregory Shakhnarovich

Adrien Gaidon

Mac Schwager

Abstract

The central goal of robotics is to build autonomous agents that can serve alongside humans in the real, three-dimensional world. To be effective, these robots must observe their environment and act to change it according to human instructions, ideally delivered in natural language. This thesis defines the synthesis of these abilities—perceiving and acting within human-inhabited spaces through natural interaction—as ”Embodied Spatial Intelligence.” While recent advances in Large Language Models (LLMs) have brought the prospect of conversational robots closer to reality, significant challenges remain. Specifically, this work addresses two fundamental questions: how to build the right scene representation for environmental understanding, and how to create the right task representation for planning and action.

To develop more efficient and scalable representations for robot perception, Chapter 2 presents contributions that leverage implicit neural modeling. This work enhances the robustness, generalization, and scalability of scene representations through three primary contributions. Section 2.1 introduces a self-supervised learning method for camera self-calibration, providing robustness against disturbances to camera parameters. Section 2.2 details a novel approach for constructing continuous implicit depth fields that achieve state-of-the-art accuracy and extrapolation, enabled by new 3D data augmentation techniques. Finally, Section 2.3 demonstrates how to scale typically small-scale implicit representations to the level of buildings and city blocks using novel registration and blending methods.

To create better task representations that enable robots to plan and act upon observed spatial concepts, Chapter 3 investigates and facilitates the use of LLMs for spatial reasoning tasks. This chapter explores three key areas. Section 3.1 proposes a novel benchmark, generated from text-based games, to evaluate the mapping and navigation capabilities of LLMs. Section 3.2 presents a method that leverages LLMs to understand 3D spatial relationships from natural language referring expressions, using transcribed 3D scene descriptions. Lastly, Section 3.3 introduces a technique to improve long-horizon, LLM-based decision-making by incorporating a ”state” summary of action history, generated by a secondary LLM. Collectively, these contributions advance our understanding of how robots can comprehend spatial relationships, while also identifying key challenges and proposing effective solutions.

Acknowledgments

The journey toward completing this PhD has been the most intellectually challenging yet rewarding period of my life. It would not have been possible without the support, guidance, and encouragement of many individuals, to whom I am deeply grateful.

First and foremost, I would like to express my heartfelt thanks to my advisor, Professor Matthew Walter. You provided me with the invaluable opportunity to explore my academic interests and nurtured my growth as a researcher. Your unwavering support and trust have been the foundation of my development, allowing me the freedom to pursue my curiosity and inspiring my creativity. Your positive outlook on life and dedication to fostering a collaborative and friendly lab environment have made my PhD journey not only fruitful but also enjoyable. Your commitment to robotics education and its societal impact has deeply influenced my perspective on our responsibilities as researchers. I consider myself incredibly fortunate to have had you as my advisor.

I am also profoundly grateful to Professor Greg Shakhnarovich for his insightful discussions and mentorship in the field of computer vision. Your critical feedback has continually pushed my work to new heights, and your support throughout this journey has been invaluable.

A special thank you goes to my lab mates, who have made the past six years truly memorable. I am particularly thankful to Igor Vasiljevic, whose passion for 3D vision ignited my own interests in the field, and who has been a constant collaborator and source of inspiration. My gratitude also extends to Shengjie Lin, whose partnership in numerous projects around 3D perception and reasoning has deepened my understanding of the field. I would also like to thank Falcon Dai, Chip Schaff, Andrea Daniele, and Takuma Yoneda for their collaborative efforts and support in making our shared projects possible.

I would also like to extend my sincere thanks to my collaborators at Toyota Research Institute (TRI), Vitor Guizilini, Rares Ambrus, and Adrien Gaidon. Your insights and the invaluable resources you provided were instrumental in the success of many of my projects. I deeply appreciate your contributions and the opportunity to work alongside you.

I am deeply appreciative of all the TTIC faculty and staff who have provided mentorship and support over the years. You have created a unique, safe, and collaborative environment at TTIC, one that feels like a second home, allowing us to do our best intellectual work.

Last but certainly not least, I want to express my deepest gratitude to my parents, Yunhong Huang and Fengjie Fang, for their unconditional love and unwavering support throughout my life. You are the backbone that has enabled me to pursue my dreams, and for that, I am forever grateful.

List of Figures

List of Tables

Chapter 1 Introduction

Building intelligent robotic agents that are both mechanically capable and can understand their environments to operate reliably alongside humans is a long-standing goal of artificial intelligence. A system that achieves this capability may be described as possessing Embodied Spatial Intelligence. With the recent advent of large language models (LLMs) and their vision-language derivatives (VLMs), human–robot interaction has advanced markedly. However, significant challenges remain in robust spatial understanding, a critical component for safe and effective robot operation [7, 8, 9]. Addressing this gap is a central debate in modern robotics, with proposed solutions ranging from further scaling of 2D vision systems to the deep integration of 3D sensory signals. This thesis contributes to this debate by examining the fundamental role of 3D in the era of modern machine learning.

Robotics is the discipline of creating embodied agents that can intelligently interact with and modify the physical world—a three-dimensional space. At its core, a robotic system learns a mapping from observations to actions that alter the state of that world. Let the observation space be $\mathcal{O}$ (e.g., images, proprioception, language) and let $\mathrm{T}_{x}\mathbb{R}^{3}$ denote the tangent space of $\mathbb{R}^{3}$ at a point $x$ (representing, for instance, instantaneous twists in $\mathrm{SE}(3)$ or velocity fields). The fundamental problem is to learn a mapping:

[TABLE]

Because actions ultimately operate in 3D, effective robotic behavior necessitates 3D understanding. The key research question is therefore where and how such understanding should be embedded within the system.

This thesis decomposes the mapping $\mathcal{M}$ into two consecutive functions, $\mathcal{V}$ and $\mathcal{F}$ :

[TABLE]

where the first, the perception map $\mathcal{V}$ , constructs a 3D representation in a target space $\mathcal{T}\subseteq\mathbb{R}^{3}$ from observations,

[TABLE]

and the second, the action map $\mathcal{F}$ , leverages this representation to produce actions in an action space $\mathcal{A}$ ,

[TABLE]

This decomposition structures our investigation into two parts: first, building the 3D representation, and second, using it to act.

1.1 From 2D to 3D: Building Implicit Scene Representations

Historically, the robotics community has pursued 3D understanding through both hardware and software. On the hardware side, advances in 3D sensing (e.g., structured light, LiDAR, event cameras) have made direct 3D acquisition increasingly accessible. On the software side, strong geometric biases are often encoded into models, from point-based networks like PointNet [10] to volumetric and sensor-fusion methods [11, 12, 13, 14]. In contrast, biological perception relies primarily on two 2D sensors (our eyes) and motion, from which depth is recovered via binocular and monocular cues.

Drawing inspiration from this biological model, this section investigates the perception map $\mathcal{V}$ for the case where observations are 2D images ( $\mathcal{O}\subseteq\mathbb{R}^{2}$ ) and the target representation is 3D ( $\mathcal{T}\subseteq\mathbb{R}^{3}$ ). We examine monocular depth estimation, the learnability of general 3D inductive biases, and the challenge of creating scalable scene representations.

1.1.1 Monocular Depth Estimation: A Canonical 2D–3D Mapping

A canonical instance of $\mathcal{V}$ is monocular depth estimation, where a single RGB image is mapped to a depth image. From the principles of projective geometry, this problem is ill-posed: a single 2D projection can correspond to an infinite family of 3D scenes. In practice, however, the problem is not intractable.

Evidence comes from individuals with monocular vision. Following the loss of stereopsis, the human visual system often adapts through neuroplasticity. While near-field depth perception is initially impaired, it is often recovered over time by learning to exploit monocular cues such as perspective, occlusion, and motion parallax. This natural existence proof motivates machine learning approaches to monocular depth.

The classical objection is that learning-based methods solve an ill-posed problem for which multi-view geometry provides a geometrically guaranteed solution. However, the objective is not to recover all geometrically possible scenes, but rather to learn the statistical regularities of scenes that typically occur in the physical world. This ”typical” set occupies a lower-dimensional manifold within the space of all possible scenes—an observation that underlies many modern learning problems. This principle was powerfully demonstrated by self-supervised methods; for example, the foundational work of Monodepth2 [15] showed that depth and camera pose could be learned jointly from monocular video sequences alone. Inspired by this self-supervised paradigm, we extended this approach to jointly estimate camera parameters alongside monocular depth—an even more under-constrained problem that nonetheless proves effective in practice (Section 2.1).

1.1.2 Are General 3D Inductive Biases Learnable?

While self-supervised monocular depth methods perform well, they are limited by photometric assumptions and the instability of co-training pose networks. Consequently, many 3D vision models have instead relied on strong, hand-engineered inductive biases (e.g., cost volumes). Although geometrically precise and data-efficient, these biases can limit robustness to real-world noise and scenarios where classical geometry is itself ill-posed (e.g., non-overlapping frames).

The recent availability of large, high-quality 3D datasets has made direct supervised training a viable alternative. This raises a fundamental question: are general 3D inductive biases learnable from data? In other words, can a single architecture learn to perform diverse geometric tasks like depth estimation, pose estimation, and novel view synthesis from an arbitrary number of input views?

Our work on the Depth Field Network (DeFiNe) [16] explores this question. Inspired by architectures that replace hard-coded geometric priors with learnable transformers [17], DeFiNe ingests image sequences and learns a latent scene embedding that can be decoded into multiple outputs, including depth, color, and novel views. A follow-up, DeLiRa [18], extended this paradigm to show that co-training on light-field data improves the quality of all decoded fields. Compared to models with strong inductive biases, DeFiNe exhibits superior zero-shot generalization across domains (Section 2.2).

This trend has recently culminated in models like VGGT [19], which demonstrates that a largely vanilla transformer trained on massive 3D datasets can serve as a single backbone for recovering multiple geometric signals. This provides a concrete instance of the “bitter lesson” in the 3D domain: with sufficient scale, much of the useful inductive bias previously thought to require specialized architectures is, in fact, learnable.

1.1.3 Scalable Scene Representations

Robots must operate in environments that range from connected indoor spaces to unbounded outdoor areas. While explicit 3D representations like voxel grids can readily aggregate information into global maps, implicit neural representations are typically capacity-limited and lack established aggregation mechanisms. This limitation affects both generalizable methods like DeFiNe and optimization-based approaches like NeRF [20].

We address this scalability challenge with NeRFuser [21], a framework that aligns partially overlapping, independently trained NeRFs and renders the joint space by blending their contributions. This method offers a practical path toward representing large-scale scenes without requiring a single monolithic model. As detailed in Section 2.3, the distributed nature of NeRFuser also enables applications like asynchronous, privacy-preserving map-building across a fleet of robots [22].

1.2 Acting in 3D: Task-Specific Requirements

The perception map $\mathcal{V}$ yields a 3D representation of the world. The action map $\mathcal{F}$ must translate this representation into physical action. The specific requirements for this map depend on the task. We categorize embodied tasks along two axes—response time and spatial range—which highlight different demands on the underlying 3D system:

•

Short Response, Short Range (e.g., manipulation). Requires precise, local 3D cues for high-frequency control.

•

Short Response, Long Range (e.g., autonomous driving). Demands fast reaction to distant objects, often relying on monocular cues.

•

Long Response, Long Range (e.g., indoor navigation). Favors persistent, global world models for long-horizon planning.

•

Long Response, Short Range (e.g., multi-step task planning). Places high demands on memory, state tracking, and reliable spatial grounding.

These four regimes motivate the technical contributions of this thesis.

Short Response, Short Range (Manipulation).

Here, “short range” refers to the workspace reachable by a manipulator, and “short response” denotes control frequencies above $\sim 10$ Hz. Tasks include tabletop manipulation, where objects are often fragile and grippers may lack tactile sensing, necessitating precise, high-rate control. Such systems commonly employ stereo cameras, often augmented with depth sensors. We find that explicit 3D grounding improves reliability, particularly for object selection and disambiguation in clutter. Chapter 3.2 (Transcrib3D; [23]) shows that transcribing a local 3D scene into structured text enables tool-augmented reasoning for referring-expression resolution, a key precursor to manipulation.

Short Response, Long Range (Autonomous Driving).

High-speed motion requires rapid reactions. At typical highway following distances, perception is dominated by monocular cues, even for binocular systems. This aligns with the observation that humans with monocular vision can judge long distances well [24]. Consequently, modern autonomous driving systems often prioritize 360∘ camera coverage over large stereo overlaps. Nonetheless, precise 3D remains crucial for centimeter-level localization, long-range obstacle avoidance, and robustness in adverse conditions.

Long Response, Long Range (Navigation and Exploration).

Many applications involve long-horizon planning without stringent real-time constraints. In indoor navigation, low-speed motion and reliable local obstacle avoidance shift the primary challenge from control to planning. World representations can range from simple topological graphs to dense 3D maps, depending on task requirements. To probe whether language models can perform such tasks using textual history alone, we developed MANGO [25], a benchmark for navigation in text-based environments. We find that even capable LLMs perform poorly on tasks trivial for humans, suggesting a need for explicit mapping and reasoning capabilities (Section 3.1).

Long Response, Short Range (Multi-step Task Planning).

High-level, multi-stage tasks have traditionally been tackled with discrete optimization, often assuming fully observable state. In practice, specifications are often given in natural language and state estimation is uncertain. While LLM-based planners like SayCan [26] have demonstrated impressive capabilities, they are prone to hallucination over long horizons. To address this, we propose Statler [27], a dual-LLM architecture in which one agent maintains an explicit world state while another proposes short-term plans based on that state (Chapter 3.3).

1.3 Thesis Scope and Roadmap

This thesis develops methods and systems along two primary axes:

Implicit, robust, and scalable 3D scene representations that improve generalization across domains and robustness to calibration error. 2. 2.

Hybrid systems for persistent spatial reasoning that reveal the limitations of current LLMs and introduce new architectures for maintaining state over long horizons.

Chapter 2.1 presents implicit camera self-calibration via self-supervised learning [28]. Chapter 2.2 introduces DeFiNe, a continuous depth field designed for generalization [16]. Chapter 2.3 addresses the scaling of implicit representations via NeRF registration and blending [21].

On the reasoning axis, Chapter 3.1 benchmarks the mapping and navigation abilities of LLMs, highlighting significant gaps relative to human performance [25]. Chapter 3.2 proposes Transcrib3D, which translates 3D scenes into text to enable tool-augmented, iterative reasoning [23]. Chapter 3.3 introduces Statler, a dual-LLM system that maintains explicit world state for robust, long-horizon task execution [27].

Together, these contributions support three central claims: (a) robust 3D understanding is essential for embodied agents; (b) many of the inductive biases required for this understanding are learnable at scale; and (c) a practical path toward embodied spatial intelligence lies in strategically integrating high-quality 3D structure with large-scale 2D foundation models. (A broader discussion of this synthesis appears in the Conclusion.)

Chapter 2 Robotic Scene Representations from Implicit Modeling

Ever since the introduction from Euclid, people has been using geometry to describe the world we see. However, real world scenes are usually too complex to be regular, which means analytical solutions are almost impossible to find. To be able to describe the real world, people started to use sampling methods to create discrete representations, or explicit scene representations. Typical examples include pointcloud and meshes as discrete sets of points and (points and edges), which are also widely used in robotic applications [29, 30, 31].

More recently, the maturity of neural network training has made it possible to directly model the physical scene as a continuous function over the space, or implicit neural representations. Typical examples include NeRF [32] or Neural-SDF [33]. The continuous nature of implicit neural representations forgoes the limitation of resolution, which not only promises better accuracy, but more importantly, enables data-driven training approaches. A good example is the application of self-supervised learning to monocular depth estimation creating orders of magnitude more data for training to achieve SOTA results [15].

This advancement also greatly benefit the area of robotics. Scene representations are products of perception systems, and are prerequisites to spatially intelligent robotic agents. Their jobs are to form structured information from sequences or sets of 2D camera captures to represent the 3D scene, geometrically and semantically. Unlike more traditional methods that work in settings with ideal conditions, scene representations that built for robotic systems have extra sets of requirements to work for complicated conditions that could happen in real-world applications. Such complexity is hard to model with explicit geometric constraints, but easier with implicit modeling with data-driven approaches.

For example, the camera parameters could drift from its calibrated values during driving, thus we need a system that is robust to such perturbations; When executing tasks, robots can be asked to travel long distance or over large scale scenes, thus it requires the scene representation to be scalable; Robots are expected to work in a variety of environments, thus the learning system for building the scene representation should be generalizable, meaning that the performance remain high even when the test environments are different to the training data. In the following section, I will present my paper to addressed the ”Robustness” [28], ”Generalizability” [16], ”Scalability” [21] problems respectively.

2.1 Implicit camera estimation for robust scene understanding: Self-supervised camera self-calibration

Camera calibration is integral to robotics and computer vision algorithms that seek to infer geometric properties of the scene from visual input streams. In practice, calibration is a laborious procedure requiring specialized data collection and careful tuning. This process must be repeated whenever the parameters of the camera change, which can be a frequent occurrence for mobile robots and autonomous vehicles. In contrast, self-supervised depth and ego-motion estimation approaches can bypass explicit calibration by inferring per-frame projection models that optimize a view-synthesis objective. In this paper, we extend this approach to explicitly calibrate a wide range of cameras from raw videos in the wild. We propose a learning algorithm to regress per-sequence calibration parameters using an efficient family of general camera models. Our procedure achieves self-calibration results with sub-pixel reprojection error, outperforming other learning-based methods. We validate our approach on a wide variety of camera geometries, including perspective, fisheye, and catadioptric. Finally, we show that our approach leads to improvements in the downstream task of depth estimation, achieving state-of-the-art results on the EuRoC dataset with greater computational efficiency than contemporary methods.

2.1.1 Introduction

Cameras provide rich information about the scene, while being small, lightweight, inexpensive, and power efficient. Despite their wide availability, camera calibration largely remains a manual, time-consuming process that typically requires collecting images of known targets (e.g., checkerboards) as they are deliberately moved in the scene [34]. While applicable to a wide range of camera models [35, 36, 37], this process is tedious and has to be repeated whenever the camera parameters change. A number of methods perform calibration “in the wild” [38, 39, 40]. However, they rely on strong assumptions about the scene structure, which cannot be met during deployment in unstructured environments. Learning-based methods relax these assumptions, and regress camera parameters directly from images, either by using labelled data for supervision [41] or by extending the framework of self-supervised depth and ego-motion estimation [42, 43] to also learn per-frame camera parameters [4, 44].

While these methods enable learning accurate depth and ego-motion without calibration, they are either over-parameterized [44] or limited to near-pinhole cameras [4]. In contrast, we propose a self-supervised camera calibration algorithm capable of learning expressive models of different camera geometries in a computationally efficient manner. In particular, our approach adopts a family of general camera models [1] that scales to higher resolutions than previously possible, while still being able to model highly complex geometries such as catadioptric lenses. Furthermore, our framework learns camera parameters per-sequence rather than per-frame, resulting in self-calibrations that are more accurate and more stable than those achieved using contemporary learning methods. We evaluate the reprojection error of our approach compared to conventional target-based calibration routines, showing comparable sub-pixel performance despite only using raw videos at training time.

Our contributions can be summarized as follows:

•

We propose to self-calibrate a variety of generic camera models from raw video using self-supervised depth and pose learning as a proxy objective, providing for the first time a calibration evaluation of camera model parameters learned purely from self-supervision.

•

We demonstrate the utility of our framework on challenging and radically different datasets, learning depth and pose on perspective, fisheye, and catadioptric images without architectural changes.

•

We achieve state-of-the-art depth evaluation results on the challenging EuRoC MAV dataset by a large margin, using our proposed self-calibration framework.

2.1.2 Related Work

Camera Calibration. Traditional calibration for a variety of camera models uses targets such as checkerboards or AprilTags to generate 2D-3D correspondences, which are then used in a bundle adjustment framework to recover relative poses as well as intrinsics [34, 45]. Targetless methods typically make strong assumptions about the scene, such as the existence of vanishing points and known (Manhattan world) scene structure [38, 39, 40]. While highly accurate, these techniques require a controlled setting and manual target image capture to re-calibrate. Several models are implemented in OpenCV [46], kalibr [47]. These methods require specialized settings to work, limiting their generalizability.

Camera Models. The pinhole camera model is ubiquitous in robotics and computer vision [48, 49] and is especially common in recent deep learning architectures for depth estimation [43]. There are two main families of models for high-distortion cameras. The first is the “high-order polynomial” distortion family that includes pinhole radial distortion [50], omnidirectional [35], and Kannala-Brandt [36]. The second is the “unified camera model” family that includes the Unified Camera Model (UCM) [51], Extended Unified Camera Model (EUCM) [52], and Double Sphere Camera Model (DS) [1]. Both families are able to achieve low reprojection errors for a variety of different camera geometries [1], however the unprojection operation of the “high-order polynomial” models requires solving for the root of a high-order polynomial, typically using iterative optimization, which is a computationally expensive operation. Further, the process of calculating gradients for these models is non-trivial. In contrast, the “unified camera model” family has an easily computed, closed-form unprojection function. While our framework is applicable to high-order polynomial models, we choose to focus on the unified camera model family in this paper.

Learning Camera Calibration. Work in learning-based camera calibration can be divided into two types: supervised approaches that leverage ground-truth calibration parameters or synthetic data to train single-image calibration regressors; and self-supervised methods that utilize only image sequences. Our proposed method falls in the latter category, and aims to self-calibrate a camera system using only image sequences. Early work on applying CNNs to camera calibration focused on regressing the focal length [53] or horizon lines [54]; synthetic data was used for distortion calibration [55] and fisheye rectification [56]. Using panorama data to generate images with a wide variety of intrinsics, Lopez et al. [57] are able to estimate both extrinsics (tilt and roll) and intrinsics (focal length and radial distortion). DeepCalib [41] takes a similar approach: given a panoramic dataset, generate projections with different focal lengths. Then, they train a CNN to regress from a set of synthetic images $I$ to their (known) focal lengths $f$ . Typically, training images are generated by taking crops of the desired focal lengths from $360$ degree panoramas [58, 59]. While this can be done for any kind of image, and does not require image sequences, it does require access to panoramic images. Furthermore, the warped “synthetic” images are not the true 3D-2D projections. This approach has been extended to pan-tilt-zoom [60] and fisheye [56] cameras. Methods also exist for specialized problems like undistorting portraits [61], monocular 3D reconstruction [62], and rectification [63, 64].

After the publication of this paper, [65] extends our method, and improves the performance by adding a differentiable bundle-adjustment layer. On the other hand, leveraging the same principle, differentiable camera models have also been extended to use in extrinsic calibration [66].

Self-Supervised depth and ego-motion. Self-supervised learning has also been used to learn camera parameters from geometric priors. Gordon et al. [4] learn a pinhole and radial distortion model, while Vasiljevic et al. [44] learn a generalized central camera model applicable to a wider range of camera types, including catadioptric. These methods both learn calibration on a per-frame basis, and do not offer a calibration evaluation of their learned camera model. Furthermore, while Vasiljevic et al. [44] is much more general than Gordon et al. [4], it is limited to fairly low resolutions by the complex and approximate generalized projection operation. In our work, we trade some degree of generality (i.e., a global, central vs. per-pixel model) for a closed-form and efficient projection operation and ease of calibration evaluation.

2.1.3 Methodology

First, we describe the self-supervised monocular depth learning framework that we use as proxy for self-calibration. Then we describe the family of unified camera models we consider and how we learn their parameters end-to-end.

2.1.3.1 Self-Supervised Monocular Depth Estimation

Self-supervised depth and ego-motion architectures consist of a depth network that produces depth maps $\hat{D}_{t}$ for a target image $I_{t}$ , as well as a pose network that predicts the relative rigid-body transformation between target $t$ and context $c$ frames, $\hat{\bm{X}}^{t\to c}=\begin{pmatrix}{\hat{\bm{R}}^{t\to c}}&{\hat{\bm{t}}^{t\to c}}\\ \bm{0}&\bm{1}\end{pmatrix}\in\text{SE(3)}$ . We train the networks jointly by minimizing the photometric reprojection error between the actual target image $I_{t}$ and a synthesized image $\hat{I}_{t}$ generated by projecting pixels from the context image $I_{c}$ (usually preceding or following $I_{t}$ in a sequence) onto the target image $I_{t}$ using the predicted depth map $\hat{D}_{t}$ and ego-motion $\hat{\bm{X}}^{t\to c}$ [43]. See Figure 2.2 for an overview. The general pixel-warping operation is defined as:

[TABLE]

where $\bm{i}$ are camera intrinsic parameters modeling the geometry of the camera, which is required for both projection of 3D points $\bm{P}$ onto image pixels $\bm{p}$ via $\pi(\bm{P},\bm{i})=\bm{p}$ and unprojection via $\phi(\bm{p},\hat{d},\bm{i})=\bm{P}$ assuming an estimated pixel depth of $\hat{d}$ . The camera parameters $\bm{i}$ are generally the standard pinhole model [45] defined by the $3\times 3$ intrinsic matrix $\bm{K}$ , but can include any differentiable model such as the Unified Camera Model family [1] as described next.

2.1.3.2 End-to-End Self-Calibration

UCM [51] is a parametric global central camera model that uses only five parameters to represent a diverse set of camera geometries, including perspective, fisheye, and catadioptric. A 3D point is projected onto a unit sphere and then projected onto the image plane of a pinhole camera, shifted by $\frac{\alpha}{1-\alpha}$ from the center of the sphere (Fig. 2.3). EUCM and DS are two extensions of the UCM model. EUCM replaces the unit sphere with an ellipse as the first projection surface, and DS replaces the one unit sphere with two unit spheres in the projection process. We self-calibrate all three models (in addition to a pinhole baseline) in our experiments. For brevity, we only describe the original UCM and refer the reader to Usenko et al. [1] for details on the EUCM and DS models.

There are multiple parameterizations for UCM [51], and we use the one from Usenko et al. [1] since it has better numerical properties. UCM extends the pinhole camera model $(f_{x},f_{y},c_{x},c_{y})$ with only one additional parameter $\alpha$ . The 3D-to-2D projection of $\bm{P}=(x,y,z)$ is defined as

[TABLE]

where the camera parameters are $\bm{i}=(f_{x},f_{y},c_{x},c_{y},\alpha)$ and $d=\sqrt{x^{2}+y^{2}+z^{2}}$

The unprojection operation of pixel $\bm{p}=(u,v,1)$ at estimated depth $\hat{d}$ is:

[TABLE]

where

[TABLE]

As shown in Equations 2.2 and 2.3, the UCM camera model provides closed-form projection and unprojection functions that are both differentiable. Therefore, the overall architecture is end-to-end differentiable with respect to both neural network parameters (for pose and depth estimation) and camera parameters. This enables learning self-calibration end-to-end from the aforementioned view synthesis objective alone. At the start of self-supervised depth and pose training, rather than pre-calibrating the camera parameters, we initialize the camera with “default” values based on image shape only (for a detailed discussion of the initialization procedure, please see Section 2.1.4.5). Although the projection (2.2) and unprojection (2.3) are initially inaccurate, they quickly converge to highly accurate camera parameters with sub-pixel re-projection error (see Table 2.1).

As we show in our experiments, our method combines flexibility with computational efficiency. Indeed, our approach enables learning from heterogeneous datasets with potentially vastly differing sensors for which separate parameters $\bm{i}$ are learned. As most of the parameters (in the depth and pose networks) are shared thanks to the decoupling of the projection model, this enables scaling up in-the-wild training of depth and pose networks. Furthermore, our method is efficient, with only one extra parameter relative to the pinhole model. This enables learning depth for highly-distorted catadioptric cameras at a much higher resolution than previous over-parametrized models ( $1024\times 1024$ vs. $384\times 384$ for Vasiljevic et al. [44]). Note that, in contrast to prior works [4, 44], we learn intrinsics per-sequence rather than per-frame. This increases stability compared to per-frame methods that exhibit frame-to-frame variability [44], and can be used over sequences of varying sizes.

2.1.4 Experiments

In this section we describe two sets of experimental validations for our architecture: (i) calibration, where we find that the re-projection error of our learned camera parameters compares favorably to target-based traditional calibration toolboxes; and (ii) depth evaluation, where we achieve state-of-the-art results on the challenging EuRoC MAV dataset.

2.1.4.1 Datasets

Self-supervised depth and ego-motion learning uses monocular sequences [43, 67, 4, 68] or rectified stereo pairs [67, 69] from forward-facing cameras [70, 68, 71]. Given that our goal is to learn camera calibration from raw videos in challenging settings, we use the standard KITTI dataset as a baseline, and focus on the more challenging and distorted EuRoC [3] fisheye sequences.

KITTI [70] We use this dataset to show that our self-calibration procedure is able to accurately recover pinhole intrinsics alongside depth and ego-motion. Following related work [43, 67, 4, 68] we use the training protocol of [72], including filtering static images as described by Zhou et al. [43]. The resulting training set contains of $39810$ images, with $697$ images left for evaluation.

EuRoC [3] The dataset consists of a set of indoor MAV sequences with general six-DoF motion. Consistent with recent work [4], we train using center-cropping and down-sample the images to a $384\times 256$ resolution, while training and evaluating on the same split. For calibration evaluation, we follow Usenko et al. [1] and use the calibration sequences from the dataset. We evaluate the UCM, EUCM and DS camera models in terms of re-projection error.

OmniCam [73] A challenging outdoor catadioptric sequence, containing 12000 frames captured by an autonomous car rig. As this dataset does not provide ground-truth depth information, we only provide qualitative results.

2.1.4.2 Training Protocal

We implement the group of unified camera models described in [1] as differentiable PyTorch [74] operations, modifying the self-supervised depth and pose architecture of Godard et al. [67] to jointly learn depth, pose, and the unified camera model intrinsics. We use a learning rate of $2$ e- $4$ for the depth and pose network and $1$ e- $3$ for the camera parameters. We use a StepLR scheduler with $\gamma=0.5$ and a step size of $30$ . All of the experiments are run for $50$ epochs. The images are augmented with random vertical and horizontal flip, as well as color jittering. We train our models on a Titan X GPU with 12 GB of memory, with a batch size of $16$ when training on images with a resolution of $384\times 256$ . We note that our method requires significantly less memory than that of Vasiljevic et al. [44] which learns a generalized camera model parameterized through a per-pixel ray surface.

2.1.4.3 Camera Self-Calibration

We evaluate the results of the proposed self-calibration method on the EuRoC dataset; detailed depth estimation evaluations are provided in Sec. 2.1.4.6. To our knowledge, ours is the first direct calibration evaluation of self-supervised intrinsics learning; although Gordon et al. [4] compare ground-truth calibration to their per-frame model, they do not evaluate the re-projection error for their learned parameters.

Following Usenko et al. [2], we evaluate our self-supervised calibration method on the family of unified camera models: UCM, EUCM, and DS, as well as the perspective (pinhole) model. As a lower bound, we use the Basalt [2] toolbox and compute camera calibration parameters for each unified camera model using the calibration sequences of the EuRoC dataset. We note that unlike Basalt, our method regresses the intrinsic calibration parameters directly from raw videos, without using any of the calibration sequences.

Table 2.1 summarizes our re-projection error results. We use the EuRoC AprilTag [75] calibration sequences with Basalt to measure re-projection error using the full estimation procedure (Table 2.1 — Target-based) and learned intrinsics (Table 2.1 — Learned). For consistency, we optimize for both intrinsics and camera poses for the baselines and only for the camera poses for the learned intrinsics evaluation. Note that with learned intrinsics, UCM, EUCM and DS models all achieve sub-pixel mean projection error despite the camera parameters having been learned from raw video data.

Table 2.2 compares the target-based calibrated parameters to our learned parameters for different camera models trained on the cam0 sequences of the EuRoC dataset. Though the parameter vectors were initialized with no prior knowledge of the camera model and updated purely based on gradients from the reprojection error, they converge to values very close to the output of a procedure that uses bundle adjustment on calibrated image sequences.

2.1.4.4 Camera Rectification

Using our learned camera parameters, we rectify calibration sequences on the EuRoC dataset to demonstrate the quality of the calibration. EuRoC was captured with a fisheye camera and exhibits a high degree of radial distortion that causes the straight edges of the checkerboard grid to be curved. In Figure 2.4, we see that our learned parameters allow for the rectified grid to track closely to the true underlying checkerboard.

2.1.4.5 Camera Re-calibration: Perturbation Experiments

Thus far, we have assumed to have no prior knowledge of the camera calibration. In many real-world robotics settings, however, one may want to re-calibrate a camera based on a potentially incorrect prior calibration. Generally, this requires the capture of new calibration data. Instead, we can initialize our parameter vectors with this initial calibration (in this setting, a perturbation of Basalt calibration of the EUCM model) and see the extent to which self-supervision can nudge the parameters back to their “true value”.

Given Basalt parameters $I_{c}=[f_{x},f_{y},c_{x},c_{y},\alpha,\beta]$ , we perturb them as $I_{1.1}=1.1\times I_{c}$ , $I_{1.05}=1.05\times I_{c}$ , $I_{0.95}=0.95\times I_{c}$ , $I_{0.9}=0.9\times I_{c}$ and initialize the camera parameters at the beginning of training with these values. All runs have warm start, i.e., freezing the gradients for the intrinsics for the first $10$ epochs while we train the depth and pose networks. As Figure 2.5 shows, our method converges to within $3\%$ of the Basalt estimate for each parameter. Table 2.3 provides the values of the converged parameters along with the mean projection error (MRE) for each experiment.

2.1.4.6 Depth Estimation

While we use depth and pose estimation as proxy tasks for camera self-calibration, the unified camera model framework allows us to achieve meaningful results compared to prior camera-learning-based approaches (see Figures 2.6 and 2.7).

KITTI results. Table 2.4 presents the results of our method on the KITTI dataset. We note that our approach is able to model the simple pinhole setting, achieving results that are on par with approaches that are tailored specifically to this camera geometry. Interestingly, we see an increase in performance using the UCM model, which we attribute to the ability to further account for and correct calibration errors.

EuRoC results. Compared to KITTI, EuRoC is a significantly more challenging dataset that involves cluttered indoor sequences with six-DoF motion. Compared to the per-frame distorted camera models of Gordon et al. [4] and Vasiljevic et al. [44], we achieve significantly better absolute relative error, especially with EUCM, where the error is reduced by $16\%$ (see Table 2.5). We also train NRS [44] on this dataset for further comparison, using the official repository.

Combining heterogeneous datasets. One of the strengths of the unified camera model is that it can represent a wide variety of cameras without prior knowledge of their specific geometry. As long as we know which sequences come from which camera, we can learn separate calibration vectors that share the same depth and pose networks. This is particularly useful as a way to improve performance on smaller datasets, since it enables one to take advantage of unlabeled data from other sources. To evaluate this property, we experimented with mixing KITTI and EuRoC. In this experiment, we reshaped the KITTI images to match those in the EuRoC dataset (i.e., $384\times 256$ ). As Table 2.6 shows, our algorithm is able to take advantage of the KITTI images to improve performance on the EuRoC depth evaluation.

2.1.4.7 Computational Cost

Our work is closely related to the learned general camera model (NRS) of Vasiljevic et al. [44] given that in both works the parameters of a central general camera model are learned in a self-supervised way. Being a per-pixel model, NRS is more general than ours and can handle settings where there is local distortion, which a global camera necessarily cannot model. However, the computational requirements of the per-pixel NRS are significantly higher. For example, we train on EuRoC images with a resolution of $384\times 256$ with a batch size of $16$ , which consumes about $6$ GB of GPU memory. Each epoch takes about $15$ minutes.

On the same GPU, NRS uses $16$ GB of GPU memory with a batch size of $1$ to train on the same sequences, running one epoch in about $120$ minutes. This is due to the high-dimensional (yet approximate) projection operation required for a generalized camera. Thus, we trade some degree of generality for significantly higher efficiency than prior work, with higher accuracy on the EuRoC dataset (see Table 2.5).

2.1.5 Conclusion

We proposed a procedure to self-calibrate a family of general camera models using self-supervised depth and pose estimation as a proxy task. We rigorously evaluated the quality of the resulting camera models, demonstrating sub-pixel calibration accuracy comparable to manual target-based toolbox calibration approaches. Our approach generates per-sequence camera parameters, and can be integrated into any learning procedure where calibration is needed and the projection and un-projection operations are interpretable and differentiable. As shown in our experiments, our approach is particularly amenable to online re-calibration, and can be used to combine datasets of different sources, learning independent calibration parameters while sharing the same depth and pose network.

2.2 Implicit generalizable depth field modeling by 3D data augmentation: DeFiNe

Modern 3D computer vision leverages learning to boost geometric reasoning, mapping image data to classical structures such as cost volumes or epipolar constraints to improve matching. These architectures are specialized according to the particular problem, and thus require significant task-specific tuning, often leading to poor domain generalization performance. Recently, generalist Transformer architectures have achieved impressive results in tasks such as optical flow and depth estimation by encoding geometric priors as inputs rather than as enforced constraints. In this paper, we extend this idea and propose to learn an implicit, multi-view consistent scene representation, introducing a series of 3D data augmentation techniques as a geometric inductive prior to increase view diversity. We also show that introducing view synthesis as an auxiliary task further improves depth estimation. Our Depth Field Network (DeFiNe) achieve state-of-the-art results in stereo and video depth estimation without explicit geometric constraints, and improve on zero-shot domain generalization by a wide margin.

2.2.1 Introduction

Estimating 3D structure from a pair of images is a cornerstone problem of computer vision. Traditionally, this is treated as a correspondence problem, whereby one applies a homography to stereo-rectify the images and then matches pixels (or patches) along epipolar lines to obtain disparity estimates. Contemporary approaches to stereo are specialized variants of classical methods, relying on correspondences to compute cost volumes, epipolar losses, bundle adjustment objectives, or projective multi-view constraints, among others. These are either baked directly into the model architecture or enforced as part of the loss function.

Applying the principles of classical vision in this way has given rise to architectures that achieve state-of-the-art results on tasks such as stereo depth estimation [76, 77], optical flow [78], and multi-view depth [79]. However, this success comes at a cost: each architecture is specialized and purpose-built for a single task. Great strides have been made to alleviate the dependence on strong geometric assumptions [4, 44], and two recent trends allow us to decouple the task from the architecture: (i) implicit representations and (ii) generalist networks. Our work draws upon both of these directions.

Implicit representations of geometry and coordinate-based networks have recently achieved incredible popularity in the vision community. This direction is pioneered by work on neural radiance fields (NeRF) [32, 80], where a point- and ray-based parameterization along with a volume rendering objective allow simple MLP-based networks to achieve state-of-the-art view synthesis results. Follow-up works extend this coordinate-based representation to the pixel domain [81], allowing predicted views to be conditioned on image features. The second trend in computer vision has been the use of generalist architectures, pioneered by Vision Transformers [82]. Emerging as an attention-based architecture for NLP, Transformers have been used for a diverse set of tasks, including depth estimation [83, 84], optical flow [85], and image generation [86]. Transformers have also been applied to geometry-free view synthesis [87], demonstrating that attention can learn long-range correspondence between views for 2D-3D tasks. Representation Transformers (SRT) [88] use the Transformer encoder-decoder model to learn scene representations for view synthesis from sparse, high-baseline data with no geometric constraints. However, owing to the quadratic scaling of the self-attention module, experiments are limited to low-resolution images and require very long training periods.

To alleviate the quadratic complexity of self-attention, the Perceiver architecture [89] disentangles the dimensionality of the latent representation from that of the inputs by fixing the size of the latent representation. Perceiver IO [85] extends this architecture to allow for arbitrary outputs, with results on optical flow estimation that outperform traditional cost-volume based methods. Similarly, the recent Input-level Inductive Bias (IIB) architecture [17] uses image features and camera information as input to Perceiver IO to directly regress stereo depth, outperforming baselines that use explicit geometric constraints. Building upon these works, we propose to learn a geometric scene representation for depth synthesis from novel viewpoints, including estimation, interpolation, and extrapolation. We expand the IIB framework to the scene representation setting, taking sequences of images and predicting a consistent multi-view latent representation suitable for different downstream tasks. Taking advantage of the query-based nature of the Perceiver IO architecture, we propose a series of 3D augmentations that increase viewpoint density and diversity during training, thus encouraging (rather than enforcing) multi-view consistency. Furthermore, we show that the introduction of view synthesis as an auxiliary task, decoded from the same latent representation, improves depth estimation without additional ground truth.

We test our model on the popular ScanNet benchmark [90], achieving state-of-the-art real-time results for stereo depth estimation and competitive results for video depth estimation, without relying on memory- or compute-intensive operations such as cost volume aggregation and test-time optimization. We show that our 3D augmentations lead to significant improvements over baselines that are limited to the viewpoint diversity of training data. Furthermore, our zero-shot transfer results from ScanNet to 7-Scenes [91] improve the state-of-the-art by a large margin, demonstrating that our method generalizes better than specialized architectures, which suffer from poor performance on out-of-domain data. Our contributions are summarized as follows:

•

We use a generalist Transformer-based architecture to learn a depth estimator from an arbitrary number of posed images. In this setting, we (i) propose a series of 3D augmentations that improve the geometric consistency of our learned latent representation; and (ii) show that jointly learning view synthesis as an auxiliary task improves depth estimation.

•

Our Depth Field Networks (DeFiNe) not only achieve state-of-the-art stereo depth estimation results on the widely used ScanNet dataset, but also exhibit superior generalization properties with state-of-the-art results on zero-shot transfer to 7-Scenes.

•

DeFiNe also enables depth estimation from arbitrary viewpoints. We evaluate this novel generalization capability in the context of interpolation (between timesteps), and extrapolation (future timesteps).

2.2.2 Related Work

2.2.2.1 Monocular Depth Estimation.

Supervised depth estimation—the task of estimating per-pixel depth given an RGB image and a corresponding ground-truth depth map—dates back to the pioneering work of Saxena et al. [92]. Since then, deep learning-based architectures designed for supervised monocular depth estimation have become increasingly sophisticated [93, 72, 94, 95, 96], generally offering improvements over the standard encoder-decoder convolutional architecture. Self-supervised methods provide an alternative to those that rely on ground-truth depth maps at training time, and are able to take advantage of the new availability of large-scale video datasets. Early self-supervised methods relied on stereo data [97], and then progressed to fully monocular video sequences [43], with increasingly sophisticated losses [98] and architectures [15, 68, 99].

In a follow-up study, ZeroDepth [100] extends our input-level geometry bias framework to metric-aware monocular depth estimation with SOTA results.

2.2.2.2 Multi-view Stereo.

Traditional multi-view stereo approaches have dominated even in the deep learning era. COLMAP [101] remains the standard framework for structure-from-motion, incorporating sophisticated bundle adjustment and keypoint refinement procedures, at the cost of speed. With the goal of producing closer to real-time estimates, multi-view stereo learning approaches adapt traditional cost volume-based approaches to stereo [76, 102] and multi-view [103, 104] depth estimation, often relying on known extrinsics to warp views into the frame of the reference camera. Recently, iterative refinement approaches that employ recurrent neural networks have made impressive strides in optical flow estimation [78]. Follow-on work applies this general recurrent correspondence architecture to stereo depth [77], scene-flow [105], and even SLAM [106]. While their results are impressive, recurrent neural networks can be difficult to train, and test-time optimization increases inference time over a single forward pass.

Recently, Transformer-based architectures [107] have replaced CNNs in many geometric estimation tasks. The Stereo Transformer [83] architecture replaces cost volumes with a correspondence approach inspired by sequence-to-sequence modeling. The Perceiver IO [85] architecture constitutes a large departure from cost volumes and geometric losses. For the task of optical flow, Perceiver IO feeds positionally encoded images through a Transformer [89], rather than using a cost volume for processing. IIB [17] adapts the Perceiver IO architecture to generalized stereo estimation, proposing a novel epipolar parameterization as an additional input-level inductive bias. Building upon this baseline, we propose a series of geometry-preserving 3D data augmentation techniques designed to promote the learning of a geometrically-consistent latent scene representation. We also introduce novel view synthesis as an auxiliary task to depth estimation, decoded from the same latent space. Our video-based representation (aided by our 3D augmentations) allows us to generalize to novel viewpoints, rather than be restricted to the stereo setting.

In a subsequent work, [108] builds a unified flow, stereo and depth estimation, with a similar principle to learn multi-modal feature using cross-attention.

2.2.2.3 Video Depth Estimation.

Video and stereo depth estimation methods generally produce monocular depth estimates at test time. ManyDepth [99] combines a monocular depth framework with multi-view stereo, aggregating predictions in a cost volume and thus enabling multi-frame inference at test-time. Recent methods accumulate predictions at train and test time, either with generalized stereo [109] or with sequence data [110]. DeepV2D [79] incorporates a cost-volume based multi-view stereo approach with an incremental pose estimator to iteratively improve depth and pose estimates at train and test time.

Another line of work draws on the availability of monocular depth networks that perform accurate but multi-view inconsistent estimates at test time [111]. In this setting, additional geometric constraints are enforced to finetune the network and improve multi-view consistency through epipolar constraints. Consistent Video Depth Estimation [111] refines COLMAP [101] results with a monocular depth network constrained to be multi-view consistent. Subsequent work jointly optimizes depth and pose for added robustness to challenging scenarios with poor calibration [112]. A recent framework incorporates many architectural elements of prior work into a Transformer-based architecture that takes video data as input for multi-view depth [113]. NeuralRecon [114] moves beyond depth-based architectures to learn Truncated Signed Distance Field (TSDF) volumes as a way to improve surface consistency.

2.2.2.4 Novel View Synthesis.

Since the emergence of neural radiance fields [32], implicit representations and volume rendering have emerged as the de facto standard for view synthesis. They parameterize viewing rays and points with positional encoding, and need to be re-trained on a scene-by-scene basis. Many recent improvements leverage depth supervision to improve view synthesis in a volume rendering framework [115, 116, 117, 118, 119]. An alternative approach replaces volume rendering with a directly learned light field network [120], predicting color values directly from viewing rays. This is the approach we take when estimating the auxiliary view synthesis loss, due to its computational simplicity. Other works attempt to extend the NeRF approach beyond single scene models by incorporating learned features, enabling few-shot volume rendering [81]. Feature-based methods have also treated view synthesis as a sequence-learning task, such as the Scene Representation Transformer (SRT) architecture [88].

Building upon the same principle, DeLiRa [18] extends the output space to radiance field and light field, showing an improvement of performance when self-supervised across the output fields.

2.2.3 Methodology

Our proposed DeFiNe architecture (Figure 2.9(a)) is designed with flexibility in mind, so data from different sources can be used as input and different output tasks can be estimated from the same latent space. Similar to Yifan et al. [17], we use Perceiver IO [89] as our general-purpose Transformer backbone. During the encoding stage, our model takes RGB images from calibrated cameras, with known intrinsics and relative poses. The architecture processes this information according to the modality into different pixel-wise embeddings that serve as input to our Perceiver IO backbone. This encoded information can be queried using only camera embeddings, producing estimates from arbitrary viewpoints.

2.2.3.1 Perceiver IO

Perceiver IO [85] is a recent extension of the Perceiver [89] architecture. The Perceiver architecture alleviates one of the main weaknesses of Transformer-based methods, namely the quadratic scaling of self-attention with input size. This is achieved by using a fixed-size $N_{l}\times C_{l}$ latent representation $\mathcal{R}$ , and learning to project high-dimensional $N_{e}\times C_{e}$ encoding embeddings onto this latent space using cross-attention layers. Self-attention is performed in the lower-dimensional latent space, producing a conditioned latent representation $\mathcal{R}_{c}$ that can be queried using $N_{d}\times C_{d}$ decoding embeddings to generate estimates, again using cross-attention layers.

2.2.3.2 Input-Level Embeddings

Image Embeddings (Figure 2.9(b)). Input $3\times H\times W$ images are processed using a ResNet18 [121] encoder, producing a list of features maps at increasingly lower resolutions and higher dimensionality. Feature maps at $1/4$ the original resolution are concatenated with lower-resolution feature maps, after upsampling using bilinear interpolation. The resulting image embeddings are of shape $H/4\times W/4\times 960$ , and are used in combination with the camera embeddings from each corresponding pixel (see below) to encode visual information.

Camera Embeddings (Figure 2.9(c)).

These embeddings capture multi-view scene geometry (e.g., camera intrinsics and extrinsics) as additional inputs during the learning process. Let $\textbf{x}_{ij}=(u,v)$ be an image coordinate corresponding to pixel $i$ in camera $j$ , with assumed known pinhole $3\times 3$ intrinsics $\mathbf{K}_{j}$ and $4\times 4$ transformation matrix $T_{j}=\left[\begin{smallmatrix}\mathbf{R}_{j}&\textbf{t}_{j}\\ \textbf{0}&1\end{smallmatrix}\right]$ relative to a canonical camera $T_{0}$ . Its origin $\mathbf{o}_{j}$ and direction $\textbf{r}_{ij}$ are given by:

[TABLE]

Note that this formulation differs from the standard convention [32], which does not consider the camera translation $\textbf{t}_{j}$ when generating viewing rays $\textbf{r}_{ij}$ . We ablate this variation in Table 2.7, showing that it leads to better performance for the task of depth estimation. These two vectors are then Fourier-encoded dimension-wise to produce higher-dimensional vectors, with a mapping of:

[TABLE]

where $K$ is the number of Fourier frequencies used ( $K_{o}$ for the origin and $K_{r}$ for the ray directions), equally spaced in the interval $[1,\frac{\mu}{2}]$ . The resulting camera embedding is of dimensionality $2\big{(}3(K_{o}+1)+3(K_{r}+1)\big{)}=6\left(K_{o}+K_{r}+2\right)$ . During the encoding stage, camera embeddings are produced per-pixel assuming a camera with $1/4$ the original input resolution, resulting in a total of $\frac{HW}{16}$ vectors. During the decoding stage, embeddings from cameras with arbitrary calibration (i.e., intrinsics and extrinsics) can be queried to produce virtual estimates.

2.2.3.3 Geometric 3D Augmentations

Data augmentation is a core component of deep learning pipelines [122] that improves model robustness by applying transformations to the training data consistent with the data distribution in order to introduce desired equivariant properties. In computer vision and depth estimation in particular, standard data augmentation techniques are usually constrained to the 2D space and include color jittering, flipping, rotation, cropping, and resizing [15, 17]. Recent works have started looking into 3D augmentations [88] to improve robustness to errors in scene geometry in terms of camera localization (i.e., extrinsics) and parameters (i.e., intrinsics). Conversely, we are interested in encoding scene geometry at the input-level, so our architecture can learn a multi-view-consistent geometric latent scene representation. Therefore, in this section we propose a series of 3D augmentations to increase the number and diversity of training views while maintaining the spatial relationship between cameras, thus enforcing desired equivariant properties within this setting.

Virtual Camera Projection. One of the key properties of our architecture is that it enables querying from arbitrary viewpoints, since only camera information (viewing rays) is required at the decoding stage. When generating predictions from these novel viewpoints, the network creates virtual information consistent with the implicit structure of the learned latent scene representation, conditioned on information from the encoded views. We evaluate this capability in Section 2.2.4.5, showing superior performance relative to the explicit projection of information from encoded views. Here, we propose to leverage this property at training time as well, generating additional supervision in the form of virtual cameras with corresponding ground-truth RGB images and depth maps obtained by projecting available information onto these new viewpoints (Figure 2.10(a)). This novel augmentation technique forces the learned latent scene representation to be viewpoint-independent. Experiments show that this approach provides benefits in both the (a) stereo setting, with only two viewpoints; and (b) video setting, with similar viewpoints and a dominant camera direction.

From a practical perspective, virtual cameras are generated by adding translation noise $\bm{\epsilon}_{v}=[\epsilon_{x},\epsilon_{y},\epsilon_{z}]_{v}\sim\mathcal{N}(0,\sigma_{v})$ to the pose of a camera $i$ . The viewing angle is set to point towards the center $\textbf{c}_{i}$ of the pointcloud $P_{i}$ generated by unprojecting information from the selected camera, which is also perturbed by $\bm{\epsilon}_{c}=[\epsilon_{x},\epsilon_{y},\epsilon_{z}]_{c}\sim\mathcal{N}(0,\sigma_{v})$ . When generating ground-truth information, we project the combined pointcloud from all available cameras onto these new viewpoints as a way to preserve full scene geometry. Furthermore, because the resulting RGB image and depth map will be sparse, we can improve efficiency by only querying at these specific locations.

Canonical Jittering. When operating in a multi-camera setting, it is standard practice to select one camera to be the reference camera, and position all other cameras relative to it [104]. One drawback of this convention is that one camera will always be at the same location (the origin of its own coordinate system) and will therefore produce the same camera embeddings, leading to overfitting. Intuitively, scene geometry should be invariant to the translation and rotation of the entire sensor suite. To enforce this property on our learned latent scene representation, we propose to inject some amount of noise to the canonical pose itself, so it is not located at the origin of the coordinate system. Note that this is different from methods that inject per-camera noise [123] with the goal of increasing robustness to localization errors. We only inject noise once, on the canonical camera, and propagate it to other cameras, so relative scene geometry is preserved within a translation and rotation offset (Figure 2.10(b)). However, this offset is reflected on the input-level embeddings produced by each camera, and thus forces the latent representation to be invariant to these transformations.

In order to perform canonical jittering, we randomly sample translation $\bm{\epsilon}_{t}=[\epsilon_{x},\epsilon_{y},\epsilon_{z}]^{\top}\sim\mathcal{N}(0,\sigma_{t})$ and rotation $\bm{\epsilon}_{r}=[\epsilon_{\phi},\epsilon_{\theta},\epsilon_{\psi}]^{\top}\sim\mathcal{N}(0,\sigma_{r})$ errors from zero-mean normal distributions with pre-determined standard deviations. Represented as Euler angles, we convert each set of rotation errors to a $3\times 3$ rotation matrix $\mathbf{R}_{r}$ . We then use the rotation matrix and translation error to create a jittered canonical transformation matrix $T_{0}^{\prime}=\left[\begin{smallmatrix}\mathbf{R}_{r}&\bm{\epsilon}_{t}\\ \textbf{0}&1\end{smallmatrix}\right]$ that is then propagated to all other $N$ cameras, such that $T_{i}^{\prime}=T_{0}^{\prime}\cdot T_{i}$ , $\forall i\in\{1,\dots,N-1\}$ .

Canonical Randomization. As an extension to canonical jittering, we also introduce canonical randomization to encourage generalization to different relative camera configurations, while still preserving scene geometry. Assuming $N$ cameras, we randomly select $o\in\{0,\dots,N-1\}$ as the canonical index. We then compute the relative transformation matrix $T_{i}^{\prime}$ given world-frame transformation matrix $T_{i}$ as $T_{i}^{\prime}=T_{i}\cdot T_{o}^{-1}$ $\forall i\in\{0,\dots,N-1\}$ .

2.2.3.4 Decoders

We use task-specific decoders, each consisting of one cross-attention layer between the $N_{d}\times C_{d}$ queries and the $N_{l}\times C_{l}$ conditioned latent representation $\mathcal{R}_{c}$ , followed by a linear layer that creates an output of size $N_{d}\times C_{o}$ , and a sigmoid activation function $\sigma(x)=\frac{1}{1+e^{-x}}$ to produce values in the interval $[0,1]$ . We set $C_{o}^{d}=1$ for depth estimation and $C_{o}^{s}=3$ for view synthesis. Depth estimates are scaled to lie within the range $[d_{\text{min}},d_{\text{max}}]$ . Note that other decoders can be incorporated with DeFiNe without any modification to the underlying architecture, enabling the generation of multi-task estimates from arbitrary viewpoints.

2.2.3.5 Losses

We use an L1-log loss $\mathcal{L}_{d}=\lVert\log(d_{ij})-\log(\hat{d}_{ij})\rVert_{1}$ to supervise depth estimation, where $\hat{d}_{ij}$ and $d_{ij}$ are depth estimates and ground truth, respectively, for pixel $j$ at camera $i$ . For view synthesis, we use an L2 loss $\mathcal{L}_{s}=\lVert\textbf{p}_{ij}-\hat{\textbf{p}}_{ij}\rVert^{2}$ , where $\hat{\textbf{p}}_{ij}$ and $\textbf{p}_{ij}$ are RGB estimates and ground truth, respectively, for pixel $j$ at camera $i$ . We use a weight coefficient $\lambda_{s}$ to balance these two losses, and another $\lambda_{v}$ to balance losses from available and virtual cameras. The final loss is of the form:

[TABLE]

Note that because our architecture enables querying at specific image coordinates, we can improve efficiency at training time by not computing estimates for pixels without ground truth (e.g., sparse depth maps or virtual cameras).

2.2.4 Experiments

2.2.4.1 Datasets

ScanNet [90] We evaluate our DeFiNe for both stereo and video depth estimation using ScanNet, an RGB-D video dataset that contains $2.5$ million views from around $1500$ scenes. For the stereo experiments, we follow the same setting as Kusupati et al. [124]: we downsample scenes by a factor of $20$ and use a custom split to create stereo pairs, resulting in $94212$ training and $7517$ test samples. For the video experiments, we follow the evaluation protocol of Teed et al. [79], with a total of $1405$ scenes for training. For the test set, we use a custom split to select $2000$ samples from $90$ scenes not covered in the training set. Each training sample includes a target frame and a context of $[-3,3]$ frames with stride $3$ . Each test sample includes a pair of frames, with a context of $[-3,3]$ relative to the first frame of the pair with stride $3$ .

7-Scenes [91] We also evaluate on the test split of 7-Scenes to measure zero-shot cross-dataset performance. Collected using KinectFusion [11], the dataset consists of $640\times 480$ images in $7$ settings, with a variable number of scenes in each setting. There are $500$ – $1000$ images in each scene. We follow the evaluation protocol of Sun et al. [114], median-scaling predictions using ground-truth information before evaluation.

2.2.4.2 Stereo Depth Estimation

To test the benefits our proposed geometric 3D augmentation procedures over the IIB [17] baseline, we first evaluate DeFiNe on the task of stereo depth estimation. Here, because each sample provides minimal information about the scene (i.e., only two frames), the introduction of additional virtual supervision should have the largest effect. We report our results in Figure 2.11a and visualize examples of reconstructed pointclouds in Figure 2.12. DeFiNe significantly outperforms other methods on this dataset, including IIB. Our virtual view augmentations lead to a large ( $20\%$ ) relative improvement, showing that DeFiNe benefits from a scene representation that encourages multi-view consistency.

2.2.4.3 Ablation Study

We perform a detailed ablation study to evaluate the effectiveness of each component in our proposed architecture, with results shown in Table 2.7. Firstly, we evaluate performance when (Table 2.7:1) learning only depth estimation, and see that the joint learning of view synthesis as an auxiliary task leads to significant improvements. The claim that depth estimation improves view synthesis has been noted before [116, 118], and is attributed to the well-known fact that multi-view consistency facilitates the generation of images from novel viewpoints. However, our experiments also show the inverse: view synthesis improves depth estimation. Our hypothesis is that appearance is required to learn multi-view consistency since it enables visual correlation across frames. By introducing view synthesis as an additional task, we are also encoding appearance-based information into our latent representation. This leads to improvements in depth estimation even though no explicit feature matching is performed at an architecture or loss level.

We also ablate different variations of our RGB encoder for the generation of image embeddings and show that (Table 2.7:2) our proposed multi-level feature map concatenation (Figure 2.9(b)) leads to the best results relative to the standard single convolutional layer, or (Table 2.7:3) using $1/4$ -resolution ResNet18 64-dimensional feature maps. Similarly, we also ablate some of our design choices, namely (Table 2.7:4) the use of camera embeddings instead of positional encodings; (Table 2.7:5) global viewing rays (Section 2.2.3.2) instead of traditional relative viewing rays; (Table 2.7:6) the use of $\lambda_{s}=1$ in the loss calculation (Equation 2.7) such that both depth and view synthesis tasks have equal weights; and (Table 2.7:7) the use of epipolar cues as additional geometric embeddings, as proposed by IIB [17]. As expected, camera embeddings are crucial for multi-view consistency and global viewing rays improve over the standard relative formulation. Interestingly, using a smaller $\lambda_{s}$ degrades depth estimation performance, providing further evidence that the joint learning of view synthesis is beneficial for multi-view consistency. We did not observe meaningful improvements when incorporating the epipolar cues from IIB [17], indicating that DeFiNe is capable of directly capturing these constraints at an input-level due to the increase in viewpoint diversity. Lastly, we ablate the impact of our various proposed geometric augmentations (Section 2.2.3.3), showing that they are key to our reported state-of-the-art performance.

Lastly, we evaluate depth estimation from virtual cameras, using different noise levels $\sigma_{v}$ at test time. We also train models using different noise levels and report the results in Figure 2.11(b). From these results, we can see that the optimal virtual noise level, when evaluating at the target location, is $\sigma_{v}=0.25$ m (yellow line), relative to the baseline without virtual noise (blue line). However, models trained with higher virtual noise (e.g., the orange line, with $\sigma_{v}=1$ m) are more robust to larger deviations from the target location.

2.2.4.4 Video Depth Estimation

To highlight the flexibility of our proposed architecture, we also experiment using video data from ScanNet following the training protocol of Tang et al. [125]. We evaluate performance on both ScanNet itself, using their evaluation protocol [125], as well as zero-shot transfer (without fine-tuning) to the 7-Scenes dataset. Table 2.8 reports quantitative results, while Figure 2.13 provides qualitative examples. On ScanNet, DeFiNe outperforms most published approaches, significantly improving over DeMoN [109], BA-Net [125], and CVD [111] both in terms of performance and speed. We are competitive with DeepV2D [79] in terms of performance, and roughly $14\times$ faster, owing to the fact that DeFiNe does not require bundle adjustment or any sort of test-time optimization. In fact, our inference time of $49$ ms can be split into $44$ ms for encoding and only $5$ ms for decoding, enabling very efficient generation of depth maps after information has been encoded once. The only method that outperforms DeFiNe in terms of speed is NeuralRecon [114], which uses a sophisticated TSDF integration strategy. Performance-wise, we are also competitive with NeuralRecon, improving over their reported results in one of the three metrics (Sq. Rel).

Next, we evaluate zero-shot transfer from ScanNet to 7-Scenes, which is a popular test of generalization for video depth estimation. In this setting, DeFiNe significantly improves over all other methods, including DeepV2D (which fails to generalize) and NeuralRecon ( $\sim$$40\%$ improvement). We attribute this large gain to the highly intricate and specialized nature of these other architectures. In contrast, our method has no specialized module and instead learns a geometrically-consistent multi-view latent representation.

In summary, we achieve competitive results on ScanNet and significantly improve the state-of-the-art for video depth generalization, as evidenced by the large gap between DeFiNe and the best-performing methods on 7-Scenes.

2.2.4.5 Depth from Novel Viewpoints

We previously discussed the strong performance that DeFiNe achieves on traditional depth estimation benchmarks, and showed how it improves out-of-domain generalization by a wide margin. Here, we explore another aspect of generalization that our architecture enables: viewpoint generalization. This is possible because, in addition to traditional depth estimation from RGB images, DeFiNe can also generate depth maps from arbitrary viewpoints since it only requires camera embeddings to decode estimates. We explore this capability in two different ways: interpolation and extrapolation. When interpolating, we encode frames at $\{t-5,t+5\}$ , and decode virtual depth maps at locations $\{t-4,\dots,t+4\}$ . When extrapolating, we encode frames at $\{t-5,\dots,t-1\}$ , and decode virtual depth maps at locations $\{t,\dots,t+8\}$ . We use the same training and test splits as in our stereo experiments, with a downsampling factor of $20$ to encourage smaller overlaps between frames. As baselines for comparison, we consider the explicit projection of 3D information from encoded frames onto these new viewpoints. We evaluate both standard depth estimation networks [15, 68, 96] as well as DeFiNe itself, that can be used to either explicitly project information from encoded frames onto new viewpoints (projection), or query from the latent representation at that same location (query).

Figure 2.14 reports results in terms of root mean squared error (RMSE) considering only valid projected pixels. Of particular note, our multi-frame depth estimation architecture significantly outperforms other single-frame baselines. However, and most importantly, results obtained by implicit querying consistently outperform those obtained via explicit projection. This indicates that our model is able to improve upon available encoded information via the learned latent representation. Furthermore, we also generate geometrically consistent estimates for areas without valid explicit projections (Figure 2.14(c)). As the camera tilts to the right, the floor is smoothly recreated in unobserved areas, as well as the partially observed bench. Interestingly, the chair at the extreme right was not recreated, which could be seen as a failure case. However, because the chair was never observed in the first place, it is reasonable for the model to assume the area is empty, and recreate it as a continuation of the floor.

2.2.5 Conclusion

We introduced Depth Field Networks (DeFiNe), a generalist framework for training multi-view consistent depth estimators. Rather than explicitly enforcing geometric constraints at an architecture or loss level, we use geometric embeddings to condition network inputs, alongside visual information. To learn a geometrically-consistent latent representation, we propose a series of 3D augmentations designed to promote viewpoint, rotation, and translation equivariance. We also show that the introduction of view synthesis as an auxiliary task improves depth estimation without requiring additional ground truth. We achieve state-of-the-art results on the popular ScanNet stereo benchmark, and competitive results on the ScanNet video benchmark with no iterative refinement or explicit geometry modeling. We also demonstrate strong generalization properties by achieving state-of-the-art results on zero-shot transfer from ScanNet to 7-Scenes. The general nature of our framework enables many exciting avenues for future work, including additional tasks such as optical flow, extension to dynamic scenes, spatio-temporal representations, and uncertainty estimation.

2.3 Implicit representation as modular maps for scalable scene modelling: NeRFuser

We present NeRFuser , a novel framework that extends the representational capacity of neural radiance fields (NeRFs) to produce high-fidelity representations of large-scale scenes. Integral to our approach is the decomposition of spatially extended environments into a collection of small-scale scenes, each represented by an individual NeRF model (i.e., a sub-map) produced by one or more agents. Critically, we only assume access to these individual NeRFs and not any of the original training images or the poses with which they were trained, nor the relative transformations between the NeRFs. Towards this goal, NeRFuser/space adopts a two-stage procedure that consists of NeRF registration and NeRF blending. For registration, we propose registration from re-rendering, a technique that infers the transformation between NeRFs based on images synthesized from individual NeRFs, with an effective image quality measure distant accumulation for pose filtering. For blending, we propose sample-based inverse distance weighting to blend visual information at the ray sample level. We evaluate NeRFuser/space on an indoor dataset and two public benchmarks, showing state-of-the-art results on test views including those that are otherwise challenging to render for individual source NeRFs.

2.3.1 Introduction

In order to effectively carry out complex tasks, robots require rich representations of their environments. As such, the robotics community has paid significant attention to the problem of scene reconstruction, particularly with regard to scaling to spatially extended environments [129]. Neural radiance fields (NeRFs) [130, 20] provide a powerful means of generating high-fidelity, memory-efficient 3D scene representations from commodity (i.e., low-cost) cameras, though they are typically applied to small table-top scenes. In this paper, we seek to extend the representational power of NeRFs to efficiently model large-scale environments. While recent methods have shown some success at modelling large scenes with a single NeRF [131, 132], its representation power is fundamentally limited by model capacity when the entire environment is represented by one network, which limits scalability. Instead, our approach achieves scalable NeRF representations by decomposing large environments into a collection of small scenes, each represented by their own individually trained NeRF. Such a decomposition not only improves quality and computational efficiency [133], but also allows mapping to be distributed among a team of robots and/or human-operated cameras.

Existing methods that seek to improve the scalability of NeRF models (e.g., Block-NeRF [133]) require access to the full set of original training images over which they perform registration to estimate their global poses using a large-scale structure-from-motion (SfM) method (e.g., COLMAP [29]). Other methods [131, 132] then perform a lengthy training process to produce a joint NeRF model. Instead, we consider a setting in which a team of people and/or robots can individually capture and produce NeRF (or similar) representations of small-scale scenes, and then transfer these models. In addition to their fidelity, NeRFs provide representations that are highly memory efficient. For example, a typical scene may be captured by $300$ images, each about $0.6$ MB in size. In contrast, NeRF provides an implicit representation of the scene that takes up approximately $60$ MB, a $3\times$ reduction from the set of original images. Targeting such use cases, we propose NeRFuser (Fig. 2.8), a NeRF fusion framework for the registration and blending of pre-trained NeRFs. NeRFuser requires access only to the originally constructed NeRFs as black boxes, but not the original training images or the training poses, nor the relative transformation between the individual NeRFs.111We note that training poses are not informative for registration as they are defined with respect to individual NeRF-specific reference frames. Further, while NeRFuser can make use of training poses when available for rendering, we empirically find that images rendered at non-training poses often yield better registration results. Nonetheless, NeRFuser can register NeRFs (in terms of both pose and scale) and render images from blended NeRFs with greater quality than can be achieved by rendering according to the individual NeRFs.

NeRFuser fuses NeRFs in two steps: registration and blending. For the first step, we propose registration from re-rendering, a technique that takes advantage of the ability of modern NeRFs to synthesize high-quality views to solve for 6-DoF pose and scale (i.e. $\mathrm{SIM3}$ ) transformation among the individual NeRFs via 2D image matching. A new NeRF image quality measure distant accumulation is proposed for pose filtering. For the second step, we propose a fine-grained sample-based blending technique following inverse-distance-weighting (IDW) principle as proposed in [133]. In summary, we propose

(i) registration from NeRF re-rendering, a registration method that solves for relative scale and pose to align uncalibrated implicit representations with a new NeRF image quality measure for pose filtering; and

(ii) sample-based NeRF blending, a NeRF blending method to composite predictions at the ray sample level, resulting in image quality better than images rendered by any individual NeRF.

Code is made publicly available as a Python package.

2.3.2 Related Work

**Neural Radiance Fields ** A Neural Radiance Field (NeRF) [20] is an implicit representation of a 3D scene. It optimizes a neural network composed of MLPs to encode the scene as density and radiance fields, which can be used to synthesize novel views through volumetric rendering. Since its introduction, many follow-up works [134, 131, 135, 136, 137, 138] have improved over the original implementation. One line of improvement involves the reconstruction of large-scale NeRFs [139, 140, 133, 141, 142, 143, 144, 145, 133]. However, most of these works focus on reconstructing the entire scene with a single model. While progressive training [139, 141] and carefully designed data structures [140, 142, 143] have helped to expand the expressivity of a single model, other works [145, 133] have shown that a collection of many small models can perform better, while maintaining the same number of parameters. Our method provides a novel way to reason over many small models, combining them to improve performance.

**NeRF Registration ** NeRFs are optimized from posed images, with poses usually obtained using a structure-from-motion (SfM) method [29, 30, 146, 147, 148, 149, 150]. Because these methods are scale-agnostic, the resulting coordinate system will have an arbitrary scale specific to each NeRF. Jointly using multiple NeRFs requires NeRF Registration, i.e., solving for the relative transformation among their coordinate systems. Note that the setting is different from “NeRF Inversion” [151, 152] that estimates the 6-DoF camera pose relative to the pre-trained NeRF given an image, a technique that has been used for NeRF-based localization [153, 154, 155]. However, these tasks can potentially be handled by NeRFuser/space if formulated as NeRF-to-NeRF pose estimation problems. LENS [156] proposes using NeRF to render extra images to train a per-scene pose regression network, while NeRFuser assumes neither training poses nor training images and does not require training an extra network for each scene. Also relevant are works that jointly optimize NeRF representations along with the poses and intrinsics [157, 158, 159]. However, NeRFuser/space only uses SfM on re-rendered images, and does not modify the pre-trained NeRFs themselves. Of the few works in this emerging field of NeRF registration, most [160, 161, 162] utilize only geometric cues—extracting surface or density fields from the NeRF representations, and thus do not take full advantage of the rich radiance information. On the other hand, DReg-NeRF [163] learns to predict 3D correspondences given the input NeRFs, which may suffer from a severe performance drop for out-of-distribution scenes. Moreover, DReg-NeRF as well as nerf2nerf [160] are only capable of registering two NeRFs at a time, require known scales, and fail catastrophically for unbounded large-scale scenes. nerf2nerf [160] further requires a reasonable initialization from human annotations. In contrast, our method can register multiple NeRFs simultaneously including scale recovery, is designed to work on large-scale real-world scenes, and does not involve human annotations.

NeRF Blending Given aligned NeRFs from NeRF registration, it remains a problem as how to best merge them, i.e., blending. While image blending is a highly researched topic in computational photography [164, 165], there are a few works that discuss blending in terms of NeRFs. Nerflets [166] performs decomposition of the scene with ellipsoids and blend the NeRFs according to their covariance matrices while each NeRF has access to information from all images. Blended-NeRF [167] focuses on generative metrics rather than the reconstruction setting as in ours. More similar to our setting is Block-NeRF [133], which proposes to blend NeRFs in either image- or pixel-wise manner, similar to traditional 2D image blending techniques. In terms of blending weights, it suggests to use either inverse distance weighting (IDW) or predicted visibility where IDW is eventually adopted because of temporal consistency. Note that the assumption of our paper is that, only the trained NeRFs are provided without their training camera poses, which are required by visibility prediction as they are used to train the visibility network where training poses gives data for calculating ground truth visibility. Thus visibility prediction is not feasible in our setting.

3D Gaussian Splatting Compared to NeRF, 3DGS [168] is a more explicit 3D representation that employs a flexible set of Gaussian blobs to encode the scene geometry and radiance field. While it is out of scope to compare 3DGS with NeRF, the need for registering and jointly rendering multiple 3DGS similarly applies. Though this work targets the use of NeRFs, some of the proposed techniques may shed inspiration on solving the same problem with this scene representation, especially for the registration.

2.3.3 Methodology

In this section, we describe our NeRF registration method registration from re-rendering and our blending technique IDW-Sample. Without loss of generality, we discuss the case where two NeRFs $A$ and $B$ are involved, as extension to three or more NeRFs is straightforward, of which the implementation is included in our public code release.

2.3.3.1 Registration from NeRF Re-rendering

The first part of our framework is to estimate the relative transformations among the input NeRFs. We assume that each NeRF is trained with its own set of images, and that individual NeRFs capture different, yet overlapping, areas of a big scene. Each NeRF may have its own coordinate system that is inconsistent with others. So our goal is to find the transformation $T_{BA}\in\mathrm{SIM}(3)$ that transforms a 3D point $p_{B}$ in NeRF $B$ to its corresponding point $p_{A}$ in NeRF $A$ as $p_{A}=T_{BA}p_{B}$ .

Overview Since NeRFs are known for high quality novel view synthesis, we strategically sample a set of poses and use them as local poses to query each input NeRF to get re-rendered images. We then re-purpose off-the-shelf structure-from-motion methods [148, 147] on the union of re-rendered images. Intuitively, the SfM step tries to recover the poses of all re-rendered images in a shared coordinate system. Note that the pose of each re-rendered image in its source NeRF’s local coordinate system is sampled, hence known, so we can solve the transformation from the shared SfM coordinate system to each NeRF’s local one. The relative transformation between NeRFs can then be inferred. Detailed derivation of how $\mathrm{SIM}(3)$ transformations among coordinate systems can be recovered from $\mathrm{SE}(3)$ poses of individual camera images is presented in the supplementary material.

Sampling strategy of poses for re-rendering Though we do not assume access to the exact poses of images used in generating the input NeRFs, we do assume that these poses are pre-processed in a standardized way [169]:

(i) centered so that the mean translation vector becomes the origin;

(ii) rotated so that the mean up direction is aligned with $z$ axis;

(iii) uniformly scaled so that the largest range of pose locations along an axis is $[-1,1]$ .

Accordingly, our strategy is to uniformly sample poses mainly on the upper hemisphere (due to (ii)) of radius $1$ (due to (iii)) located at the origin (due to (i)). This pose sampling strategy is simple yet effective given no access to NeRFs’ training poses.

Distant Accumulation for Pose Filtering Some of the renderings using the poses sampled from the above procedure can be of poor quality nonetheless (Figure 2.17(a)). Although the SfM procedure can be robust to poor renderings to some extent (through its feature points extraction and matching), it is beneficial to filter them out to reduce processing time and improve accuracy. Computed from NeRF’s volumetric rendering, accumulation can be used to measure NeRF ray uncertainty [170], since low accumulation values usually correspond to under-trained pixels. However, we find that regular accumulation does not discriminate poor renderings (Figure 2.17(b)). While it does spot some under-trained pixels, the mean accumulation over the entire image is too close to $1.0$ , which is too sensitive to noise for filtering. We further observe that,

(i) when a ray is under-trained, the termination probability mass tends to spread evenly;

(ii) cameras inside or very close to an object usually bring poor renderings.

Based on the observations, we propose a new NeRF ray quality measure Distant Accumulation

[TABLE]

where $T(t),\sigma(t)$ denotes transmittance and density respectively, as introduced in [20]. Note that distant accumulation $q_{d}$ is equivalent to regular accumulation when $d=0$ . We use distant accumulation $q_{d}$ averaged over all pixels as the image quality measure, where $d$ is set to be a moderate value within $[0,1]$ (recall that the NeRF’s coordinate system is assumed to be normalized in scale; $d=0.3$ is used in Figure 2.17(c)). The idea is that,

(i) when a ray is under-trained, evenly spread termination probability results in $q_{d}$ being substantially smaller than $1.0$ ;

(ii) when a camera is inside or too close to an object, most rays will have high termination probability at a close distance, also resulting in a small $q_{d}$ .

As shown in Figure 2.17(c), our proposed measure clearly discriminates against poor renderings. We use the mean distant accumulation to filter out poor renderings before applying the SfM procedure.

Robustness to SfM errors As our proposed registration method works on top of SfM, we discuss its robustness to SfM errors, which are manifested in two ways:

(i) failure to recover poses for some images and

(ii) returning erroneous poses for some others.

For (i), to compute the transformation from the shared SfM coordinate system to a NeRF’s local one, our method only requires that the poses of as few as two re-rendered images from that NeRF are correctly recovered by SfM (required when computing the relative scale using distances of pose pairs; see Equation (13) in supplementary material). For (ii), our method recovers a list of candidate results for both scale and transformation (details in supplementary material), so it boils down to solving a constrained optimization problem to obtain the final result. Computing the mean would produce the optimal estimation under Gaussian noise assumption following maximum likelihood estimation principle. In practice, median is computed instead for more robustness to outliers.

2.3.3.2 Sampled-based NeRF Blending

NeRF blending aims to combine predictions from the input NeRFs in pursuit of high-quality novel view synthesis. There are two key questions to consider:

(i)

What to blend: At what granularity should we blend the information? 2. (ii)

How to blend: How to compute the blending weights?

Block-NeRF [133] answers (ii) with inverse-distance-weighting (IDW) and predicted visibility weighting. IDW determines the contribution of each NeRF according to $w_{i}\propto d_{i}^{-\gamma}$ , where $d_{i}$ is some notion of distance between the center of NeRF $i$ to the element in question (answered by What to blend), $\gamma\in\mathbb{R}^{+}$ is a hyper-parameter that modulates the blending rate. On the other hand, visibility weighting tries to weight the element by its visibility from the camera views during NeRF training. However, a visibility prediction network is required to be trained jointly with the NeRF and used during inference, which is a requirement that we can not assume for typical NeRFs. Despite slightly better visual evaluation metrics, due to temporal consistency issues with visibility weighting, Block-NeRF presents their demo with IDW weighting (IDW-2D, explained below).

For question (i), Block-NeRF answers it with image-wise and pixel-wise blending. For image-wise blending, all pixels in an image have the same blending weights. When combined with IDW (IDW-2D), the weights are computed based on the distance between the camera center and NeRF centers. For pixel-wise blending, each pixel has its own blending weight. When combined with IDW, the weights are computed from the distance between the expected point of ray termination for each pixel and NeRF centers. These techniques naturally inherit from image blending methods. However, as NeRF is inherently a 3D representation whose computing happens at the ray sample level, keep using 2D approaches deems sub-optimal.

In this paper, recognizing the fact that the color of a pixel is computed using samples along the ray in NeRF during volumetric rendering, we answer (i) by proposing a novel sample-based blending method. On the other hand, although IDW may not be the best measure of quality, it is nevertheless a good approximation under the assumption of surrounding camera view distribution. Thus we still follow the IDW principle to answer (ii) but adapt it by calculating the distance between the ray sample and NeRF centers, achieving a finer blending result in principle. Moreover, the weighting is irrespective of depth quality, which can be an issue for NeRF and thus the major downside of IDW-3D. Since we use IDW with sample-based blending, we coin our method IDW-Sample. Figure 2.18 provides an illustration of the comparison between different methods.

2.3.3.3 Sample-wise Blending with IDW

During NeRF’s volumetric rendering stage, a pixel’s color is computed using samples along the ray. Recognizing this fact, we propose a sample-wise blending method that calculates the blending weights for each ray sample using IDW. We show that the original volumetric rendering methodology can be easily extended to take advantage of these new sample-wise blending weights, resulting in our proposed IDW-Sample strategy.

Proximity Test The first question to ask before blending is always whether it needs blending, as NeRFs can only render with high quality within their effective range. Using distant NeRFs for rendering, whose quality is poor, can only be harmful. Hence, we introduce a Proximity Test to decide which NeRFs are relevant for the given rendering pose.

As we located each NeRF in a shared coordinate frame from the NeRF registration phase, we can calculate the distances between the given rendering camera position to all NeRFs and sort them in an increasing order $d_{0}<d_{1}<\dots<d_{N-1}$ , assuming there are $N$ NeRFs to blend. We then divide them by the closest distance $d_{0}$ to have a set of distance ratios $(1=\frac{d_{0}}{d_{0}}<\frac{d_{1}}{d_{0}}<\dots<\frac{d_{N-1}}{d_{0}})$ . A threshold $\tau$ is set to filter out NeRFs with distance ratio larger than it (meaning that they are too far away). Note that this threshold $\tau$ can be set in an adaptive way, for example keeping only NeRFs with $k$ -smallest or $k$ -th quantile $\tau$ for blending. So after the filtering, the computational complexity for NeRF blending is $\mathcal{O}(k)$ .

Merge Ray Samples Consider a pixel to be rendered, which gets unprojected into a ray. Since ray samples are separately proposed according to the density field of each source NeRF, we need to merge them into a single set. Given samples $\{(t^{A}_{k},\delta^{A}_{k})\}_{k}$ and $\{(t^{B}_{k},\delta^{B}_{k})\}_{k}$ proposed from NeRF $A$ and NeRF $B$ , respectively, we merge them into a single set of ray samples $\{(\bar{t}_{k},\bar{\delta}_{k})\}_{k}$ by taking the sample location $t$ and length $\delta$ , as illustrated by Figure 1 in the supplementary material. We update the termination probability and color of each new sample in the merged set for each source NeRF. Given a ray sample proposed by a NeRF, we assume its termination probability mass is uniformly distributed over its length, while its color is the same for any point within coverage.

Blending Process We use IDW to compute the blending weight for each sample. Specifically, let $\bm{x}_{i}$ be the origin of $\textrm{NeRF}_{i}$ for $i\in\{A,B\}$ , $\bm{o}$ be the camera’s optical center, $\bm{r}=(\bm{o},\bm{d})$ be the ray corresponding to pixel $j$ to be rendered, and $(\bar{t}_{k},\bar{\delta}_{k})$ be a ray sample from the merged samples set, with updated termination probability mass $\bar{p}_{i,k}$ and color $\bar{\bm{c}}_{i,k}$ for each $\textrm{NeRF}_{i}$ . We compute its blending weight as $w_{i,k}\propto{}{d_{i,k}}^{-\gamma}$ , where $d_{i,k}=\|\bm{x}_{i}-(\bm{o}+\bar{t}_{k}\bm{d})\|_{2}$ . The blended pixel $j$ is

[TABLE]

Weights $w_{i,k}$ are normalized following two steps in order:

(i) $\sum_{i}w_{i,k}=1,\;\forall k$ ; and

(ii) $\sum_{k}\sum_{i}w_{i,k}\bar{p}_{i,k}=1$ .

Step (i) means that our method does not change the relative weighting of samples along a given ray, which is already dictated by the termination probability. Step (ii) ensures that the rendered pixel has a valid color.

2.3.4 Experiments

In this section, we describe our registration and blending experiments on our self-collected object-centric indoor dataset as well as two public benchmarks: ScanNet [171] and MissionBay [133]. While our framework is agnostic to base NeRF implementation, in our experiments, we use NeRFacto [169] as the base NeRF model. All NeRF training hyper-parameters follow the default settings except that camera optimization is turned off to avoid affecting registration evaluation. Detailed settings of the datasets can be found in supplementary results.

2.3.4.1 Datasets

Object-Centric Indoor Scenes

We created a dataset consisting of three indoor scenes, using an iPhone $13$ mini in video mode. Each scene consists of three video clips—we choose two objects in each scene, and collect two (overlapping) video sequences that focus on each object. We collect a third (test) sequence that observes the entire scene as the test set. We test NeRFuser on this dataset and report results of both registration and blending in Section 2.3.4.2.

ScanNet Dataset

The ScanNet dataset provides a total of $1513$ RGB-D scenes with annotated camera poses, from which we use the first $218$ scenes. We downsample the frames so that roughly $200$ posed RGB-D images are kept from each scene. We then split the images into three sets: two for training NeRFs and one for testing. We test NeRF registration on this dataset and compare with point-cloud registration in Section 2.3.4.3.

Mission Bay Dataset

To further test the rendering quality of different blending methods, we run experiments on the Mission Bay dataset from Block-NeRF [133], which features a street scene from a single capture. We report results in Section 2.3.4.4.

2.3.4.2 NeRF Fusion (Registration plus Blending)

We test NeRFuser including both NeRF registration and NeRF blending on Object-Centric Indoor Scenes. We report numbers averaged over test images of all three scenes from our dataset in Table 2.9. Detailed experimental setup can be found in supplementary materials.

2.3.4.3 NeRF Registration

To further test the registration performance on a large-scale dataset, we use the ScanNet dataset [171] as prepared according to Section 2.19. We repeat the same registration procedure as in Section 2.3.4.2, except that $60$ hemispheric poses are sampled instead of $30$ . During experiments, we notice failure cases due to NaN or outlier values. To report more meaningful numbers, we treat cases that meet any of the following conditions as failure:

(i) is NaN or

(ii) $r_{\textrm{err}}>5^{\circ}$ or

(iii) $t_{\textrm{err}}>0.2$ or

(iv) $s_{\textrm{err}}>0.1$ .

We also compare our method against various point-cloud registration (PCR) baselines using both

(i) point-clouds extracted from NeRFs and

(ii) point-clouds fused from ground-truth posed RGB-D images.

We report in Table 2.10 the results of our registration method and various PCR baselines averaged over all successfully registered scenes, as well as the success rate. Detailed experimental results can be found in supplementary results.

2.3.4.4 NeRF Blending

To further test our blending performance, we use the outdoor Mission Bay dataset as described in Section 2.19, with ground-truth transformations. We set the distance test ratio $\tau=1.2$ , and the blending rate $\gamma=10$ . Quantitative results averaged over test images of all scenes are reported in Table 2.11. Qualitative results are visualized in Figure 2.19.

2.3.4.5 Ablation Studies

Ablation of distant accumulation for filtering poses in NeRF registration We ran NeRF registration experiments using different thresholds of the proposed mean distant accumulation for filtering poses. As shown in Figure 2.20, distant accumulation-based filtering removes mostly bad images when compared to the regular accumulation, leading to a significant decrease in registration time, while yielding similar registration accuracy.

Ablation on re-rendering poses for NeRF registration

We study the registration performance on ScanNet dataset w.r.t. the number of sampled poses. To account for the fact that each scene may be of a different scale, we introduce $\rho$ as the ratio of the number of sampled poses over the number of training views. We geometrically sample $\rho\in[0.167,1.3]$ , and generate the hemispheric poses accordingly. We evaluate the performance of NeRF registration w.r.t. $\rho$ averaged over all ScanNet scenes. In addition, we include 2 more settings.

(i) training poses only: instead of hemispheric poses, use NeRF’s training poses for re-rendering;

(ii) hemispheric $+$ training poses: use NeRF’s training poses together with hemispherically sampled ones ( $\rho=0.3$ ) for re-rendering.

Results are reported in Figure 2.21. With more sampled poses, the registration errors go down while success rate improves (green curve). Using additional hemispheric poses besides the training ones also proves helpful (orange line vs. blue line). Interestingly, with a large enough ratio $\rho$ , registration with hemispherically sampled poses outperforms training poses when using the same number or fewer poses in total. It shows that it is beneficial to have a larger spatial span of re-rendering poses for registration.

Ablation of $\gamma$ in IDW-based blending

We study the effect of blending rate $\gamma$ in IDW-based blending on Object-Centric Indoor Scenes. Specifically, we use ground-truth transformations and set distance test ratio $\tau=1.8$ . We geometrically sample $\gamma$ in $[10^{-2},10^{3}]$ . For each sampled $\gamma$ , we blend NeRFs with all IDW-based methods and report the results averaged over test images of all 3 scenes from Object-Centric Indoor Scenes. Since Nearest and NeRF are not affected by $\gamma$ , we draw dotted horizontal lines for comparison. The results are shown in Figure 2.22. For all blending methods, quality initially increases with $\gamma$ , but then decreases as $\gamma$ increases further. In $\gamma\rightarrow{}0$ case, all IDW-based methods become the same as using the mean image. In $\gamma\rightarrow{}\infty$ case, IDW-2D becomes the same as Nearest, while IDW-Sample becomes analogous to KiloNeRF [145]. For details, please see the supplementary. We find the optimal $\gamma$ in between the extremes for any IDW-based method. Moreover, our proposed IDW-Sample almost always performs the best for any given $\gamma$ .

2.3.5 Conclusion

We have introduced NeRFuser/space, a NeRF fusion pipeline that registers and blends arbitrary input NeRFs, extending a standard image-based processing pipeline to treat NeRFs as input data. To address NeRF registration, we propose registration from re-rendering, taking advantage of NeRFs’ ability to synthesize high quality novel views with distant accumulation,including an effective image quality measure distant accumulation for pose filtering. To address NeRF blending, we propose IDW-Sample, leveraging the ray sampling nature in NeRF volumetric rendering. We believe this tool will help towards the proliferation of implicit representations as raw data for future 3D robotic applications.

Chapter 3 Embodied Spatial Reasoning with Hybrid Systems

Once scene representations are constructed, robotic agents must reason about their environment based on natural language instructions to execute appropriate actions. While earlier systems were limited to structured languages [175], the advent of Large Language Models (LLMs) [176] has significantly enhanced the ability to process natural language. This breakthrough has created new opportunities for robots to perform complex spatial reasoning directly from human commands. However, emerging research suggests that LLMs primarily exhibit ”System 1” thinking—a fast, automatic, and intuitive mode of cognition [177]. This stands in contrast to the slow, deliberate, and conscious ”System 2” thinking, which is essential for solving the complex, long-horizon tasks prevalent in robotics, as described in foundational psychology studies [178]. My own work [25] corroborates this limitation. We systematically evaluated the navigation and mapping capabilities of several top-performing LLMs using data from text-based game interactions. Our results revealed a significant performance gap between even the leading model, GPT-4, and human benchmarks. This indicates that direct generation from LLMs reflects ”System 1” thinking, struggling with compositional reasoning tasks such as path-finding or destination identification. To bridge this gap and enable ”System 2” thinking, I have focused on developing hybrid systems that augment LLMs with external tools and structured processes. For the task of 3D referring resolution—which requires understanding spatial relationships from natural language expressions—I proposed Transcrib3D [23]. Recognizing the scarcity of large-scale 3D training data, Transcrib3D diverges from methods that operate directly on 3D representations [179, 180]. Instead, it transcribes 3D scenes into textual descriptions. This transformation allows the LLM to offload complex spatial calculations to external tools, such as code execution, while focusing on high-level decision-making. This hybrid approach achieved state-of-the-art results on 3D referring resolution benchmarks. Long-horizon planning presents another significant challenge for purely generative LLMs, which often struggle with maintaining context and consistency over extended interactions [181]. This capability is critical for robotics, where tasks typically involve executing a long sequence of actions. To address this, my work on Statler [27] introduces the concept of an explicit state for LLM-based reasoning. In this framework, the state serves as a concise summary of the task’s history, which is iteratively updated by one LLM based on the previous state and new observations. A second LLM then uses this state to plan the next action, effectively grounding its decisions in the full context of the task. This two-LLM, state-centric system demonstrably outperforms single-LLM inference, even those enhanced with prompting techniques like Chain-of-Thought [182]. In summary, these works demonstrate a powerful paradigm: using language as a unified interface for both scene representation and task specification. This approach reduces the complexity of raw 3D scenes and enables joint planning directly within the language space, unlocking new possibilities for human-robot interaction. However, this strategy is not without its trade-offs. On the representation side, transcribing 3D scenes into language is a form of lossy compression, potentially discarding geometric details crucial for certain tasks. On the reasoning side, even with architectural improvements, LLMs continue to face limitations in long-term state tracking and compositional logic—skills that are foundational for robust robotic operation. Furthermore, human cognition for embodied tasks relies on more than just abstract language [183, 184, 185, 186, 187]. The path forward likely involves a co-evolution of both representations and model capabilities. I envision a future with more expressive, multi-modal latent representations that retain nuanced information while remaining language-decodable, coupled with advanced models capable of seamlessly integrating both ”System 1” and ”System 2” thinking from multi-modal inputs.

3.1 Benchmarking Mapping and Navigation Capabilities of LLMs: MANGO

Large language models such as ChatGPT and GPT-4 have recently achieved astonishing performance on a variety of natural language processing tasks. In this paper, we propose MANGO, a benchmark to evaluate their ability to perform text-based mapping and navigation. Our benchmark includes $53$ mazes taken from a suite of textgames: each maze is paired with a walkthrough that visits every location but does not cover all possible paths. The task is question-answering: for each maze, a large language model reads the walkthrough and answers hundreds of mapping and navigation questions such as “How should you go to Attic from West of House?” and “Where are we if we go north and east from Cellar?”. Although these questions are easy for humans, it turns out that even GPT-4, the best-to-date language model, performs poorly when answering them. Further, our experiments suggest that a strong mapping and navigation ability would benefit the performance of large language models on relevant downstream tasks, such as playing textgames. Our MANGO benchmark will facilitate future research on methods that improve the mapping and navigation capabilities of LLMs. We host our leaderboard, data, code, and evaluation program at http://mango.ttic.edu/.

3.1.1 Introduction

Mapping and navigation are fundamental abilities of human intelligence [188, 189]. Humans are able to construct maps—in their minds [189] or on physical media like paper—as they explore unknown environments. Following these maps, humans can navigate through complex environments [188, 190, 191], making informed decisions, and interact with their surroundings. Such abilities empower humans to explore, adapt, and thrive in diverse environments. An example is remote (e.g., deep-sea) exploration for which humans have drawn upon their intuition to develop algorithms that enable robots to autonomously navigate and map their surroundings based only on onboard sensing.

Do large language models (LLMs) possess such abilities? In this paper, we investigate this research question by creating a benchmark and evaluating several widely used LLMs. Our MANGO benchmark is the first to measure the mapping and navigation abilities of LLMs. It includes $53$ complex mazes, such as the one visualized in Figure 3.1. It pairs each maze with hundreds of destination-finding questions (e.g., “Where will you be if you go north, north, and then up from Altar?”) and route-finding questions (e.g., “How do you reach Dome Room from Altar?”). For each maze, the language model has to answer these questions after reading a walkthrough of the maze. Many questions involve possible routes that are not traced during the walkthrough, making the benchmark challenging. In our experiments, GPT-4 only correctly answered half of the route-finding questions, performing disastrously on the difficult questions (e.g., those involving long and unseen routes). MANGO will facilitate future research in improving the mapping and navigation abilities of LLMs.

Another contribution of MANGO is to draw a novel connection between natural language processing and robotics. There has been significant interest in employing LLMs to endow intelligent agents (including robots) with complex reasoning [192]. Aligning with this interest, MANGO enables the investigation of the LLMs’ capabilities in simultaneous localization and mapping (SLAM) within text-based worlds. Focusing on this aspect, our work stands out and complements previous SLAM-related research, which predominantly relies on richer sensory inputs (e.g., vision and LiDAR).

3.1.2 Related Work

Over the past few years, the field of natural language processing has experienced remarkable advancements with the emergence of large language models. This progress has spurred a multitude of research endeavors that propose benchmarks challenging the limits of these models. Those benchmarks assess the capacities of LLMs in linguistics [193, 194], reading comprehension [195, 196], commonsense reasoning [197, 198, 199, 200], arithmetic reasoning [201, 202, 203], and knowledge memorization and understanding [204, 205, 206, 207, 208, 209]. Recent models have achieved remarkable performance not only on these benchmarks, but also across a diversity of human-oriented academic and professional exams [210] as well as general tasks [211]. Our benchmark presents a unique challenge to large language models, evaluating their capacity to acquire spatial knowledge about new environments and answering complex navigation questions; it is a dimension orthogonal to the aforementioned reasoning abilities.

The advances of LLMs have sparked a recent wave of endeavors that integrate these models into embodied agents [212, 192, 213, 214]. Generally, they utilize language models as a means to understand human instructions and plan executable actions [215, 216, 217, 26]. This includes instructions related to object manipulation and tool operation [218, 219] as well as localization and navigation [220, 221, 222, 223]. Our MANGO benchmark aligns with the growing trend to deploy LLMs in embodied agents and provides a comprehensive investigation of their capacities in mapping and navigation. Our benchmark operates in text-based environments, distinguishing itself from previous benchmarks [224, 225, 226] that allow agents to utilize visual signals. This “text-only” design enables us to conduct controlled experiments that investigate the capacity of language models to acquire knowledge about environments solely from textual inputs and answer navigation questions based on that knowledge. It complements the existing benchmark and methodological research in vision-language navigation [227, 228, 229, 230, 231, 232]. Our work is related to recent studies that demonstrate the emergence of maps with learned neural representations as a consequence of navigation [233, 234] with the key distinction that our agents are provided with textual descriptions of their environments.

Given our focus on mapping and navigation, it is worth noting the work on simultaneous localization and mapping (SLAM), a classic problem in which a mobile agent (e.g., a robot or hand-held camera) is tasked with mapping an a priori unknown environment while concurrently using its estimated map to localize itself in the environment [235, 129]. Particularly relevant are the methods that maintain spatial-semantic maps of the environments based on natural language descriptions [236, 237], however they rely on non-linguistic observations (e.g., vision) to ground these descriptions.

3.1.3 A Benchmark for Text-Based Mapping and Navigation

Our MANGO benchmark measures the mapping and navigation capabilities of LLMs. It leverages a suite of text-based adventure games that offer expert-designed complex environments but only require simple actions. Figure 3.1 is an example: it was drawn according to the first 70 steps of the walkthrough of Zork-I, which can be found in LABEL:lst:walkthrough. This map is imperfect: the annotator had to draw the only Kitchen twice to avoid a cluttered visualization; the Living Room was incorrectly placed outside the House. However, equipped with this map, one could correctly answer questions about any route in the maze such as “How do you reach Dome Room from Altar?” and “Where will you be if you go north, north, and up from Altar?”. The walkthrough has not traced a route from Altar to Dome Room, but humans possess the remarkable capacity to plan a route by identifying the three individual steps—which the walkthrough has covered—from Dome Room to Altar and retracing those steps. MANGO tests whether a large language model can perform the same kind of reasoning. Particularly, when evaluating a language model, we first let it read a walkthrough like LABEL:lst:walkthrough and then ask it questions like those in LABEL:lst:dfq and LABEL:lst:rfq. A question like LABEL:lst:dfq is a destination-finding (DF) question, and a question like LABEL:lst:rfq is a route-finding (RF) question. Users of MANGO have the flexibility to phrase the DF and RF questions in their own ways: as shown in LABEL:lst:dfqraw and LABEL:lst:rfqraw, we provide the skeletons of these questions, which users can plug into their own templates.

3.1.3.1 Maze Collection: From Game Walkthroughs to Mazes

Our mazes are taken from the textgames in the Jericho game suite [238]. The main release of Jericho includes $57$ popular textgames as well as a program that can generate walkthroughs for 56 of them. The original walkthrough of a game is a list of actions (such as east, north, and open door) that one could execute to efficiently complete the game. We enhanced each walkthrough by executing the sequence of actions and augmenting each step with the new observation (i.e., the text feedback that the game engine provides after the action is executed). Unless explicitly specified, the word “walkthrough” refers to the enhanced, but not original, walkthroughs (such as LABEL:lst:walkthrough) throughout the paper.

In a walkthrough, not every action triggers a location change: it may update the inventory (such as take lamp and drop pen) or time (such as wait). For each game, we read the walkthrough, labeled the actions (such as east and up) that change the locations, and made note of the names of the locations (such as Temple and Altar). This annotation is nontrivial and can not be automated. We had to pay extra attention to appropriately handle the tricky cases including: ① the name of a location may be mentioned in a rich, but distracting context (e.g., the context may have ten paragraphs and hundreds of words with the name briefly mentioned in the middle); ② a location may be visited multiple times, so we need to assign the same name to all its mentions; ③ different locations may be referred to with the same name in the textual feedback, so we need to rename them in a sensible way.

The location name resolution results in a maze for each game. Three of the games have no location change, and so we left them out, resulting in $53$ mazes. We store each maze as a directed graph: each node is a named location (e.g., Altar); each directed edge is a movement (e.g., north); and each node-edge-node combination is a location-changing step that was followed in the walkthrough. Note that a graph may be cyclic since the walkthrough may trace back-and-forth between locations (e.g., Temple and Egyptian Room in Figure 3.1).

3.1.3.2 Generation of Question Skeletons: Traversing Mazes and Imputing Edges

To generate DF and RF skeletons for a maze, a naive approach is to perform brute-force traversal. First, we collect all the possible S-P-D tuples, where S and D are locations and P is a simple path from S to D. A simple path is a directed path that does not visit any location more than once. This “simple” restriction ensures that we will have a finite number of S-P-D tuples. LABEL:lst:route is a simple path of 3 S-A-D edges from Altar to Dome Room. Each unique S-P-D tuple gives a unique DF skeleton: e.g., LABEL:lst:dfqraw is obtained from LABEL:lst:route. Each unique S-P-D tuple gives an RF skeleton as well, such as LABEL:lst:rfqraw obtained from LABEL:lst:route. However, the same RF skeleton may be obtained from other tuples since there may be multiple possible simple paths between the same pair of locations S and D. As a consequence, we may end up with fewer RF questions than DF questions for a given maze.

The particular DF and RF questions in LABEL:lst:dfq and LABEL:lst:rfq are challenging to large language models, since they involve actions—such as going north from Altar to Temple—that are not covered in the walkthrough. Answering such hard questions requires a deeper understanding of the spatial relationships between locations. However, also because these steps are not in the walkthrough, the skeletons in LABEL:lst:dfqraw and LABEL:lst:rfqraw can not be obtained through a naive traversal of the directed graph in Figure 3.1. That is, we have to traverse an extended graph that includes imputed edges. An imputed edge denotes a valid step that is not explicitly mentioned in the walkthrough, such as going north from Altar to Temple (i.e., Altar-north-Temple). Most mentioned edges involve directional moves (e.g., up, east), so reversing them is a straightforward way to impute new edges. We manually examined other edges: for some of them, we proposed intuitive reverses (such as exit for enter); for the others (e.g., pray), no reverse could be found. We then examined the imputed edges through real game play and discarded those failing to cause the expected location changes.

After extending all the mazes in our benchmark, we collected 21046 DF skeletons and 14698 RF skeletons by traversing the extended graphs. Being evaluated on a maze, the LLM may not be able to consume the entire walkthrough in its context window. That is, we may only feed it an appropriate prefix of the walkthrough (e.g., the first $70$ steps for Zork-I as shown in LABEL:lst:walkthrough), leaving some of the DF and RF skeletons unanswerable given that prefix. Therefore, our benchmark provides the ANSWERABLE label (an integer) for each skeleton such that this skeleton is only answerable if the maximum STEP NUM in that prefix (e.g., $70$ in LABEL:lst:walkthrough) is greater than or equal to its ANSWERABLE label. Furthermore, given a walkthrough prefix, an answerable skeleton may be easy or hard, depending on whether it involves edges that are not covered in the prefix. Precisely, a DF skeleton is considered to be easy if all the S-A-D edges in its corresponding simple path are covered in the walkthrough prefix; an RF skeleton is easy if the shortest simple path from its starting location to its destination only involves the S-A-D steps covered in the prefix. When a longer walkthrough prefix is used, more answerable questions tend to become easy. Our benchmark provides the EASY label (also an integer) for each skeleton: a skeleton is easy if the maximum STEP NUM in the walkthrough prefix is no smaller than its EASY label; otherwise, it is a hard skeleton.

3.1.3.3 Evaluation Program

The evaluation program in our benchmark implements a range of evaluation and analysis methods. Reading the model-generated answers, it can return a set of evaluation scores together with rich analysis. In this section, we introduce the most important scores used in our main experiments.

For DF questions, the most straightforward evaluation is the success rate: i.e., the fraction of questions that the language model answers correctly. What answers will be considered to be correct? A strict criteria is that the model answer is correct if and only if it exactly matches the ground-truth location name. However, due to the variability of natural language, a correct answer may not exactly match the ground-truth. For example, the model may yield The House or That House when the ground-truth location name is just House. To account for such cases, we generalize the success rate to allow partial matches. Given a model answer $\hat{A}$ and the ground-truth answer $A$ , we compute their (character-level) edit-distance $d$ and define a correctness score $c\mathrel{\stackrel{{\scriptstyle\textnormal{\tiny def}}}{{=}}}1-d/\ell$ where $\ell$ is the length of the longer answer. The score is $\in[0,1]$ : when the answer exactly matches the ground-truth, we have $c=1$ ; if they have no character overlap at all, then $c=0$ . We then define the success rate to be the sum of the correctness scores over all the questions, divided by the number of questions.

For RF questions, the main metric is still the success rate, but the definition of “success” is different from that for DF questions. Note that an answer to an RF question is a sequence of moves. We consider an answer to be correct if and only if it can reach the destination after our evaluation program executes it in the maze. A correct answer to an RF question may not be a good path: it doesn’t have to be the shortest; it doesn’t even have to be a simple path. It is possible that an LLM-generated move is meaningful but doesn’t exactly match any valid move in the graph: e.g., the LLM may give walk south, which means the same as south. Therefore, when executing a model-generated move, our evaluation program will select the closest (i.e., smallest edit-distance) valid move.

3.1.4 Experiments

In this section, we present the results our evaluation of several widely used LLMs.

3.1.4.1 Experiment Setup

The evaluated models are: GPT-3.5-turbo [239, 240, 241], GPT-4 [210], Claude-instant-1 [242], Claude-2 [243], Llama-2 with 13B parameters [244], and RWKV with 14B parameters [245]. For GPTs and Claudes, we used the prompt templates in LABEL:lst:maptemplate and LABEL:lst:navtemplate, converting the DF and RF skeletons like LABEL:lst:dfqraw and LABEL:lst:rfqraw into LLM-friendly questions like LABEL:lst:dfq and LABEL:lst:rfq. The templates were carefully designed and examined through pilot experiments, in order to ensure that we do not underestimate the models on our benchmark. In our templates, each question starts with a list of legal actions, followed by a list of reachable locations; these lists help mitigate the hallucination of language models. The templates ask the model to spell out the entire trajectory including all the intermediate locations. This design is inspired by Chain-of-Thought prompting [246]: eliciting an LLM to give its entire reasoning process tends to improve its overall performance on downstream tasks. In addition, it allows us to conduct a deeper evaluation and analysis, such as the reasoning accuracies of the models. Note that our templates request the model to form its answer as a list of Python dictionaries with specific key names. We found that this restriction encourages the model to generate structured answers—which are easy to parse and analyze—as well as improves its performance. For Llama-2 and RWKV, we made moderate revisions to the prompts in order to generate well-structured answers as well as optimize for their performance.

For GPT-3.5, we experimented with the 4K version, which can consume 4096 tokens in its context window. This context limit restricts the length of the walkthrough that it can read, and the number of DF and RF questions that it can answer. For GPT-4, Claude-1 and Claude-2, we used the same walkthrough prefixes and questions as GPT-3.5 for a fair comparison. we used the same walkthrough prefixes and questions as GPT-3.5 for a fair comparison. Llama-2 has a $4096$ context window as well. But its tokenizer is different from GPTs’ so we evaluated it on a slightly different set of questions. RWKV is capable of handling infinite context. For each maze, we experimented it with the $70$ -step prefix of the walkthrough so that its set of answerable questions includes all the questions answered by all the other models. We also evaluated Llama-2 and RWKV in a simplified setting, where the observation at each step of the walkthrough only includes the location name but nothing else. For example, at STEP 1 of the simplified LABEL:lst:walkthrough, OBSERVATION only has South of House and everything else (i.e., Your are...) is omitted. We refer to Llama-2 and RWKV with the simplified walkthroughs as Llama-2-S and RWKV-S, respectively.

3.1.4.2 Main Results

Figure 3.2 presents the success rates of all models. For each kind of question (i.e., DF or RF), we show the results on easy and hard questions separately. As we can see, GPT-4 significantly outperforms all the other models on all kinds of questions. However, it only correctly answers half of the RF questions, far worse than what a human could do: in our experiments, humans perfectly answered a randomly sampled set of questions.

Note that each model was evaluated on its specific set of questions determined by the length and format of the walkthrough it read. To be fair, we also compared each pair of models on the intersection of the questions that they answered. The results are presented in Table 3.1: as we can see, GPT-4 and GPT-3.5 consistently outperform the other models and GPT-4 significantly outperforms GPT-3.5.

3.1.4.3 Analysis of GPTs

Now, we focus our analysis on the best model, namely GPT-4. Particularly, we would like to understand the improvements of GPT-4 over GPT-3.5 as well as its current bottlenecks, shedding light on opportunities for future improvements.

By analyzing the errors of GPT-3.5 and GPT-4, we discovered that these models occasionally hallucinate nonexistent locations or edges. Once they made such a mistake at any step of reasoning, they would be misled and deviate from the correct path towards the correct answer. Furthermore, we found that the mazes are not equally difficult for the models. Figure 3.3 displays the success rates of the GPT models broken down into their per-game results. In Figure 3.3, each dot is a maze: the $x$ -axis coefficient is the performance of GPT-3.5 on this maze while the $y$ -axis is that of GPT-4.

As we can see, the success rates of the models vary across different mazes as well as across different kinds of questions. GPT-4 consistently outperforms GPT-3.5 across nearly all the mazes. The only exception is Seastalker: there are too few hard DF questions for this maze, and thus it is a noisy outlier. Apparently, both GPTs tend to work better on easy questions than on hard questions. However, some mazes seem to be particularly challenging to GPT-4, such as Zenon and OMNIQuest.

What makes those mazes challenging? We collected some important statistics about the mazes and analyzed their correlation with the success rates of the models. To understand the success rates on the easy questions, it is interesting to investigate:

•

number of locations (# locations) and number of explicit edges (# exp edges). They directly measure the size of a maze, which may be a key indicator of its difficulty.

•

number of potentially confusingly named locations (# conf locations). Recall from Paragraph 3.1.3.1 that different locations may have similar or related names, which may confuse a language model. To quantify the number of confusingly named locations, we compute a confusion score for each location, and then sum the scores across all the locations. For a location name $A$ , the confusion score is defined to be the maximum word-level edit distance between $A$ and any other location name in the maze, divided by the maximum word-level length of the pair of location names being compared. Technically, it is $\max_{B}\left(\text{edit-distance}(A,B)/\max(\text{len}(A),\text{len}(B))\right)$ , and it is $\in[0,1]$ .

•

average length of the easy simple paths (avg len easy), i.e., the simple paths that do not include any imputed edges. A longer path may tend to be more difficult for models.

•

average number of words in the scene descriptions (avg len scene). The walkthroughs exhibit very diverse styles: for some of them, the text description for each scene is very concise and the name of each location is appropriately highlighted; for others, each description may be verbose (e.g., ten paragraphs and hundreds of words) and the location names are often not obvious from the contexts. It is useful to analyze whether a long scene description poses a challenge for the models.

In order to understand the models’ performance on hard questions, we analyze the effects of the variables above (except avg len easy) as well as the following:

•

number of imputed edges (# imp edges);

•

average length of hard—i.e., involving imputed edges—simple routes (avg len hard);

•

average number of imputed edges in the hard simple routes (avg # imp in hard).

We use regression analysis to understand the effects of these variables on model performance. In particular, for each model on each type of question (DF or RF, easy or hard), we ran single-variable linear regression to understand how the success rate varies with each variable.

•

on easy questions, GPTs are significantly influenced by the size of the maze, the confusion level of location name, and the path length. The $p$ -values are extremely small.

•

on easy questions, the average length of the scene descriptions does not have a significant effect on the performance of GPT-3.5, but interestingly has a significant positive effect on GPT-4’s performance. It is perhaps because GPT-4 possesses a strong capability to understand texts and can leverage the rich contexts in each description. This allows it to better distinguish confusingly named locations and establish a better internal representation of the map. However, this richness seems to confuse GPT-3.5 and impede its ability to create a good internal representation of the maze, possibly due to GPT-3.5’s weaker overall language understanding capabilities.

•

on hard questions, the variables do not significantly affect the performance of GPT-3.5. Note that GPT-3.5 yields very low success rates when answering the hard DF and RF questions. GPT-3.5 seems to struggle when it has to reason about a path with any number of imputed edges, making the effect of other factors less important to its performance.

•

on hard questions, GPT-4 exhibits a stronger ability to handle paths with imputed edges, compared to GPT-3.5. However, it will experience difficulties when the challenge of inferring imputed edges is amplified by other factors such as the size of the maze or the length of the path. As a result, nearly all the variables have significant effects on GPT-4.

The results of our regression analysis are consistent with the plots in Figure 3.3. For example, both Zenon and OMNIQuest stay at the lower-left corners of the hard-question plots in Figure 3.3 since their mazes are particularly challenging to both GPT-3.5 and GPT-4: they both have substantially larger numbers of imputed edges than the other mazes; OMNIQuest also has more locations. Wishbringer and Lost Pig have several imputed edges, but their paths are short, so they fall in the upper-left corners of the hard-question plots in Figure 3.3.

3.1.4.4 Does Mapping and Navigation Ability Matter in Downstream Tasks?

Now we present a case study showing that a strong mapping and navigation ability of an LLM would benefit it in downstream tasks. In particular, we selected 284 minigames in the Jericho game suite, and investigated how the map knowledge may improve the performance of an LLM in playing these minigames. Each minigame is a selected step from one of 53 textgames; the selection criterion is that the best action to take at this step is a movement. Through evaluating an LLM on a minigame, we synthesize a scenario in which the LLM has to figure out the best action to take given its previous actions and observations (i.e., the prefix of walkthrough up to the current step). Note that this task is different and more challenging than answering the DF and RF questions: the LLM is not explicitly given a route (as in DF questions) or a destination (as in RF questions), but has to spontaneously figure out which action may contribute to its long-term goal.

For this task, we evaluated GPT-3.5 and GPT-4. For each model, we tried two settings: the first is to condition the LLM on the walkthrough like LABEL:lst:walkthrough; the second is to include in the prompt the information about the nearby locations, and an example of the full prompt is given in LABEL:lst:playgameprompt. The information about nearby locations is the ground-truth information that the LLM, in principle, should have learned from the walkthrough prefix. If the LLM had a perfect mapping and navigation ability, it would be able to perfectly spell it out and use that information to guide its decision making. Figure 3.4 presents the results of this experiment. GPT-4 significantly outperforms GPT-3.5 in playing these minigames, consistent with their relative performance when answering the DF and RF questions of our MANGO benchmark. For each of the GPT models, having access to nearby location information significantly improves its performance, demonstrating that a strong mapping and navigation ability is essential to succeeding at relevant downstream tasks.

3.1.5 Conclusion

We present MANGO, a benchmark that evaluates the mapping and navigation abilities of large language models. Our benchmark covers a diversity of mazes as well as a variety of evaluation and analysis programs, offering a comprehensive testbed in a great breadth and depth. In our experiments, the current best model still performs poorly on the benchmark, with a sharp degradation on the more difficult questions. We release our benchmark—along with the source code for data generation and evaluation—to track the advances of the mapping and navigation capabilities of future LLMs as well as to facilitate future research in related areas.

3.2 3D Referring Expression Resolution through LLMs with iterative reasoning: Transcrib3D

If robots are to work effectively alongside people, they must be able to interpret natural language references to objects in their 3D environment. Understanding 3D referring expressions is challenging—it requires the ability to both parse the 3D structure of the scene and correctly ground free-form language in the presence of distraction and clutter. We introduce Transcrib3D, an approach that brings together 3D detection methods and the emergent reasoning capabilities of large language models (LLMs). Transcrib3D uses text as the unifying medium, which allows us to sidestep the need to learn shared representations connecting multi-modal inputs, which would require massive amounts of annotated 3D data. As a demonstration of its effectiveness, Transcrib3D achieves state-of-the-art results on 3D reference resolution benchmarks, with a great leap in performance from previous multi-modality baselines. To improve upon zero-shot performance and facilitate local deployment on edge computers and robots, we propose self-correction for fine-tuning that trains smaller models, resulting in performance close to that of large models. We show that our method enables a real robot to perform pick-and-place tasks given queries that contain challenging referring expressions. Code is available at https://ripl.github.io/Transcrib3D.

3.2.1 Introduction

Comprehending a natural language expression that mentions an object within a given environment is a routine activity for humans. It often occurs as part of a question (e.g., “Whose jacket is hanging on the black chair?”) or an instruction (e.g., “Pass me the smaller yellow mug.”). This capability is important for embodied agents that work with humans to accomplish tasks. While humans excel in this task with over 90% accuracy on existing benchmarks [247], contemporary methods only achieve mediocre accuracy. They typically rely on supervised learning, e.g., training a Transformer module to obtain contextualized embeddings of various modalities (i.e., text, image, and point-cloud), from which the final prediction is made by a small decoder.

However, bridging different modalities in a latent space is challenging. For instance, state-of-the-art 2D vision-language understanding models like CLIP [248] require billions of image-text pairs for training, yet still exhibiting a limited grasp of compositional and relational concepts [249]. The challenge intensifies in 3D domains, where annotated data is much more scarce [250]. As a result, the capacity of existing models to perform 3D reference resolution is limited.

On the contrary, philosophers such as Ludwig Wittgenstein argue that our understanding of reality is confined by the language we use, who famously stated, “The limits of my language mean the limits of my world.” This concept underpins our approach, wherein we propose to employ text as the unifying medium to bridge the gap between 3D scene parsing and referential reasoning. This approach is grounded in realizing that the challenge of resolving 3D referring expressions can be fundamentally divided into two components: detection (identifying objects in the scene) and reasoning (associating one of the candidates with the referring expression). By harnessing text as a cohesive bridge, we can capitalize on the recent advancements in 3D detection and the enhanced natural language reasoning abilities offered by large language models (LLMs) [251].

Specifically, from the results of an off-the-shelf 3D detector [6], Transcrib3D first converts the detected spatial and semantic 3D scene information—the category, location, spatial extent, and color of objects—into texts, thereby creating an object-centric 3D scene transcript (hence the name of our method, Transcrib3D). We then filter out non-relevant objects in regard to the query (e.g., information about a trash can is not relevant to the expression “the white pillow on top of the chair and next to a blue pillow”). Subsequently, we compose a prompt that incorporates the filtered 3D transcript and the referring expression, and process it through an LLM-based reasoning mechanism. The reasoning module incorporates three key elements to make LLMs more effective and generalizable for the task:

(1) Iterative code generation and reasoning;

(2) Principle-guided zero-shot prompting; and

(3) Fine-tuning from self-reasoned correction.

We evaluate our method on standard 3D referring expression resolution benchmarks, ReferIt3D [247] and ScanRefer [252], achieving state-of-the-art performance on both. We also perform real robot experiments (Fig. 3.6) that task a robot manipulator with following natural language commands that require sophisticated 3D spatial-semantic reasoning, demonstrating the practicality of the method.

3.2.2 Related Work

The problem of resolving 3D referring expressions has garnered significant attention of late, in large part due to the introduction of the ReferIt3D [247] and ScanRefer [252] benchmarks. ReferIt3D contains two subsets: SR3D, which consists of template-based utterances, and NR3D, which consists of human-sourced free-form utterances. Contemporary methods [253, 179, 254, 255] perform 3D referring expression resolution by aggregating different input modalities into contextualized embeddings using Transformer architectures [256] in an end-to-end fashion. MVT [253] projects 3D information into 2D to achieve better feature encoding. BUTD-DETR [179] fine-tunes detected 3D bounding boxes within the Transformer. SAT [254] uses 2D semantics during training to learn a mapping from the query to its 3D grounding. HAM [255] presents a hierarchical alignment model that learns multi-granularity visual and linguistic representations. Different from these methods, NS3D [257] proposes a neuro-symbolic framework that utilizes a language-to-code model to generate programs, where each module is represented by neural networks. D3Net [258] and 3DJCG [259] jointly learn 3D captioning and grounding together, where D3Net proposes self-critical training while 3DJCG proposes task-agnostic shared modules and separate task-specific heads. ViL3DRel [260] designs a spatial self-attention layer to account for relative distances and orientations between 3D objects. 3D-VisTA [180] performs pre-training on a dataset of $278$ k 3D scene-text pairs, and fine-tunes the model on specific tasks.

In contrast, we propose to connect 3D detection and the LLM reasoning module via a textual representation, which spares us from learning the joint representation of different input modalities from limited 3D annotated data.

3.2.3 Grounding Large Language Models

LLMs trained on Internet-scale text data have shown dominant performance across various NLP tasks [251]. However, LLMs have to be grounded to answer questions or execute actions in the physical world. VisProg [261] uses in-context learning for LLMs to generate code for 2D image processing tasks. SayCan [262] and SayPlan [263] instead modulate the LLM outputs with a model of the perceived environment (i.e., “affordances” for SayCan and scene graphs for SayPlan).

When it comes to explicitly reason over 3D inputs, one branch of works train multi-modal models that directly incorporate 3D representations into the token library. 3D-LLM [264] incorporates distilled 3D features [265] with language tokens. 3D-VisTA [180] performs multi-modal fusion of language and PointNet++ features [10] with self-supervised masked encoding training.

Instead of multi-modal approaches, studies have investigated the use of textual 3D information to enhance downstream tasks. For instance, Feng et al. [266] employ textual 3D data in conjunction with LLMs to generate 3D indoor scenes. Similarly, Yu et al. [267] use LLMs for multi-robot navigation, leveraging extracted 3D scene information. We see this body of work as indicative of the promise of textual 3D scene information, and a motivation for our approach. Concurrent with our work, Yang et al. [268] also use LLMs to reason over textual data, however, their methodology differs in its use of a three-step, task-specific reasoning process as opposed to our general, flexible approach.

3.2.4 LLM Reasoning

There are many techniques proposed to enhance the reasoning capabilities of LLMs [182, 269, 270]. The reasoning module of our method is similar in spirit to the general framework of ReAct [271], where each round of code generation is followed by a round of LLM analysis to proceed with the reasoning or debugging, although our context is 3D-specific. Self-correction is also studied in LLM community. In contrast to Reflexion [272], which uses test-time environmental feedback, we instead collect feedback from training samples, and use the corrected samples to fine-tune models.

3.2.5 Methodology

Figure 3.5 illustrates our proposed Transcrib3D framework. Given the input colored point-cloud, Transcrib3D first applies a 3D object detector to generate an exhaustive list of objects in the scene transcribed as text (Section 3.2.5.1). The list is then filtered to identify objects relevant to the provided referring expression (Section 3.2.5.2). The resulting filtered object information along with the referring expression serve as input to the LLM-based reasoning module for inference. To reach the final answer, the reasoning module includes a “code interpreter” mode in which the LLM iterates between code generation and reasoning over outputs from code execution (Section 3.2.5.3). We consider two options for interfacing with the LLM:

zero-shot with principles-guided prompting, and
fine-tuning from self-reasoned correction

(Section 3.2.5.5).

3.2.5.1 Detect and Transcribe 3D Information

Given a colored point-cloud of the scene, we first perform semantic segmentation of objects using Mask3D [6]. We associate with each detected object its category based on the semantics, center location, spatial extent according to the 3D bounding box, as well as its mean color. The 3D orientation of an object can also be incorporated using PartNet [273]. Transcrib3D compiles the information associated with all detected objects as a list to form an object-centric scene description. The following shows an example of such a scene transcript (exact numeric values and duplicate object categories are replaced by …).

3.2.5.2 Pre-Filtering Relevant Objects for Utterance

The aforementioned procedure results in a representation of every object detected in the scene. However, only a small fraction of these objects will typically be relevant to the given referring expression. Given the utterance, “chair in the corner of the room, between white and yello desks”, the model identifies the following objects as relevant,

Spatial Semantic Scene Description for Relevant Objects

scene0592: … Scene center: […] … objs list:

wall, id=8, ctr=[…], size=[…], rgb=[…];

…

armchair, id=15, ctr=[…], size=[…], rgb=[…];

…

chair, id=19, ctr=[…], size=[…], rgb=[…];

…

and the remainder as irrelevant. Simplifying the object list by filtering out irrelevant items (taking into account synonyms and hypernyms) not only reduces processing time and the number of tokens for LLMs, but also facilitates reasoning by reducing content that may potentially be distracting. Such an approach has been shown to improve the efficiency of language grounding [274].

3.2.5.3 Iterative Code Generation and Reasoning

Compositional reasoning that involves arithmetic calculations, which is crucial for spatial reasoning, is well known to be a weakness of Transformers and LLMs [275]. However, most contemporary approaches to 3D referring expression resolution rely on a single forward pass of a Transformer to reach the final answer, which can be limiting in terms of reasoning power. In order to avoid this weakness, Transcrib3D equips the LLM with a Python interpreter and directs the LLM to generate code whenever quantitative evaluations are necessary. Transcrib3D then locally executes the generated code using a Python interpreter and appends the output to the original conversation. The resulting prompt is sent back to the LLM to generate the next response. If there is any error in executing the generated code, Transcrib3D also feeds information about the error back to the LLM for re-generation. This process continues until the LLM believes that the reasoning is complete. Figure 3.7 provides an illustration of this process.

We want to emphasize that our design does not distinguish between code generation and chat reasoning steps. Such a framework simplifies the design, allows for flexible reasoning, and makes it easy for fine-tuning. But in turn, it requires the LLM to have the ability to do both, so language models that are specifically trained for only coding or only chatting will not perform the best within our framework.

3.2.5.4 Principles-Guided Zero-Shot Prompting

Spatial reasoning can require complex compositional logic that is challenging for LLMs [275]. For example, when handling left/right spatial relations, which should factor in the observer’s viewpoint, LLMs often naively ground them with smaller/larger values in the $x$ -coordinates. In order to overcome these deficiencies, Transcrib3D employs a few general principles to guide LLM reasoning in a zero-shot fashion, including how to 1) use HSL space to match colors; 2) perform vector operations to resolve directional relations; 3) calculate the point-to-plane distance to determine who’s closer to a wall, to name just a few. We find that this set of guiding principles works across 3D referring benchmarks. The full prompt will be included in the released code.

3.2.5.5 Fine-tuning from Self-Reasoned Correction

Rule-based systems [276] are effective for relatively simple domains that involve a limited set of spatial-semantic concepts and structured language, but struggle to generalize, particularly to open-world domains [277]. Motivated by the now well known benefits of data-driven alternatives to rule-based methods, we adopt a novel fine-tuning method for LLMs that enables learning beyond the given set of rules (i.e., the general guiding principles) by enabling the model to learn from its own mistakes. In effect, we seek to endow LLMs with introspection capabilities. We do so via the following procedure:

Use the prompt with general principles on the training set for the LLM to generate an initial set of answers with elaborated reasoning process. 2. 2.

For any incorrect answer, augment the original prompt with the correct object ID and request the LLM to reflect on why the original answer is incorrect (“What went wrong?”). This is followed by a request to output “clean” reasoning for the correct answer. 3. 3.

Gather the reasoning processes of the correct examples, and the re-generated ones of the initially incorrect examples to produce the dataset for LLM fine-tuning.

Prompt 1 shows an example of the self-reasoned correction step. Note that after fine-tuning, we no longer include general principles in the prompt. In this way, the system not only incorporates the guiding principles, but further improves itself by learning from self-reasoned correction.

3.2.6 Experiments

We evaluate the effectiveness of Transcrib3D using the ReferIt3D [247] and ScanRefer [252] benchmarks. ReferIt3D formulates 3D referring expression understanding as the multiple-choice problem: given a set of segmented objects in a 3D scene along with a corresponding referring expression, identify the unique referent object from the set, typically containing several instances of the same fine-grained category. There are five different types of relations in SR3D, namely “horizontal”, “vertical”, “support”, “between” and “allocentric”, which make up approximately $81\%$ , $4\%$ , $2\%$ , $8\%$ , and $5\%$ of the data, respectively. ReferIt3D measures performance in terms of accuracy. Both Sr3D and Nr3D are split by “Easy”/“Hard” and “ViewDep”/“ViewIndep”. ”Easy” samples has one or none distractors in a scene, while ”Hard” samples have two or more. The view-dependent samples contain language descriptions that rely on viewing directions. Unlike ReferIt3D, ScanRefer does not provide object segmentation information, and instead tasks methods with returning the 3D bounding box of the target object, given the query utterance and a colored point-cloud. ScanRefer measures performance in terms of accuracy conditioned on the intersection-over-union (IoU) of the ground-truth 3D bounding box over the predicted one, with thresholds at $25\%$ ([email protected]) or $50\%$ ([email protected]).

3.2.6.1 Grounding Accuracy on ReferIt3D

We test different variations of our method against contemporary baselines on the SR3D and NR3D subsets of ReferIt3D. On NR3D, we evaluate our best model, Transcrib3D (GPT4-P), on the full test set, while other variants of Transcrib3D are evaluated on a subset of $500$ randomly sampled data points. On SR3D, we evaluate all Transcrib3D models on the same subset of $500$ random samples from the test set.111We use a subset of the test set due to the cost of evaluating all variations of Transcrib3D on the full test set, however we believe that the results would be similar on the full test set due to the templated nature of the utterances. Table 3.2 presents the results. Figure 3.8 also provides a qualitative comparison between 3D-VisTA [180] and Transcrib3D that highlights the strength of our method with regards to complex reasoning.

3.2.6.2 Grounding Accuracy on ScanRefer

We test the best variant (GPT4-P) of our method with different detection modules against baselines on the ScanRefer benchmark, which unlike ReferIt3D, does not provide methods with the ground-truth bounding boxes during inference. In addition to comparing to baseline results reported in their respective papers, we also re-run the checkpoints from 3D-VisTA [180], the best baseline method, on the same $500$ random samples from the ScanRefer validation set. Table 3.3 presents the results. We achieve state-of-the-art performance on ScanRefer with both detected or ground-truth bounding boxes. It is worth noting that the performance gain over the baseline is greater with ground-truth compared to detected bounding boxes. This is partly due to the fact that the baseline method is trained with the lower-quality detected bounding boxes, which ironically leads to the ground-truth ones being out of distribution.

3.2.6.3 Effects of Fine-tuning Methods

We study the effects of our proposed approach that involves fine-tuning from self-reasoned corrections. Following Section 3.2.5.5, we run Transcrib3D with GPT-4 on 500 samples from the NR3D training set and collect both correct and incorrect examples. For the incorrect examples, we let the LLM re-generate the self-corrected reasoning. We use the combination of correct and self-corrected examples to fine-tune the smaller model (i.e., gpt-3.5-turbo) for 3 epochs. We then evaluate all models on the same subset of 500 random samples from the NR3D test set. Notably, we remove all rule-based prompts from the fine-tuning data so that the fine-tuned models are not constrained to human-designed rules, but are instead adaptive to new rules learned via self-correction.

Note that inference is typically faster with a smaller model. A fine-tuned small model with comparable performance to that of a large model on this task would be more desirable for local deployment on edge computers or robots, with additional privacy benefits. Table 3.4 presents the results.

3.2.6.4 Referring Expressions for Robot Manipulation

We demonstrate how Transcrib3D supports a robot’s ability to follow natural-language instructions for pick-and-place manipulation, particularly when complicated referring expressions are involved. As a core capability of robot manipulation, language-guided pick-and-place involves (i) breaking down the language instruction into two referring expressions, one each for the “pick” and “place” identities; (ii) resolving the referring expressions in the context of the robot’s surrounding environment; and (iii) performing the pick-and-place actions. We conduct the pick-and-place task with a Universal Robots UR5 arm equipped with a Robotiq 3-Finger Adaptive Robot Gripper placed in a table-top setting. Figure 3.6 visualizes the execution of one utterance using our method.

To parse the given instruction, we employ a language model-generated program (LMP) from few-shot prompting as in Code-as-Policies [278], which is instructed to call the put_first_on_second(arg1, arg2)} function with desired arguments \mintinlinepythonarg1 and arg2}. This approach allows free-form texts as input, which is more flexible and natural than template parsing. The following is an example that is used in the prompt for the LMP: \beginminted[fontsize=]python ’# query: Pick up the orange between the apples and place it in the bowl

with a banana in it.

put_first_on_second(”orange between the apples”, ”bowl with a banana in it”)’

The put_first_on_second(arg1, arg2)} function first composes an exhaustive list of objects in the environment along with their spatial-semantic details. To do so, the function employs MDETR~\citekamath2021mdetr, an open-vocabulary object segmentation method, that takes as input an RGB image and an object category, and outputs the 2D spatial attributes of all objects in the scene of that category. The 2D attributes are then lifted to 3D using depth information from a Realsense RGB-D camera. Repeating the process for all candidate object categories results in a 3D spatial-semantic transcript of all objects in the scene. Transcrib3D then reasons over this transcript along with the referring expression to identify the object in question. Finally, we compute the pick and place poses accordingly to control the robot’s end effector. We provide the pseudo-code below.

⬇

17

18def put_first_on_second(self, arg1, arg2):

19# obtain the objects list in the environment

20objs = self.env.get_objs()

21pick_id = get_obj_id(objs, arg1) # Referring

22place_id = get_obj_id(objs, arg2) # Referring

23pick_pose, place_pose = self.get_obj_pose(objs, pick_id, place_id)

24self.env.step({’pick’: pick_pose, ’place’: place_pose}) # robot manipulation

We apply our method to robot manipulation by integrating Transcrib3D as the perception process in the Code-as-Policies (CaP) [278] framework (“CaP+Transcrib3D”). As a baseline, we compare to the standard implementation of Code-as-Policies, which uses MDETR [279] for perception (“CaP”). When grounding an object in the scene, CaP assumes that the object can be uniquely identified by MDETR according to its associated noun phrase (e.g., “the duckie”). CaP then simply selects the output grounding with the highest score. In order to make the comparison fair, we extend CaP’s prompt so that it is able to use the entire referring expression as the object identifier, which is indeed unique. Otherwise, CaP would perform poorly in our setting, since it has never seen complicated referring expressions in the original prompts.

We evaluate the performance of both methods on five different natural language instructions (each involving two referring expressions). Figure 3.9 shows a qualitative comparison for one of the test cases. Our results reveal that when assessed at both the referring expression and instruction levels, CaP+Transcrib3D significantly outperforms CaP. CaP+Transcrib3D achieves an instruction success rate of 80% (4 out of 5) compared to 20% (1 out of 5) for CaP, and correctly resolves 90% of the referring expressions (9 out of 10) compared to 40% (4 out of 10) for CaP. This underscores a pronounced benefit in employing Transcrib3D for resolving referring expressions within robot pick-and-place tasks.

3.2.7 Discussion

We acknowledge that Transcrib3D is not without limitations. First, our scene transcript is object-centric. Although object-level details such as bounding boxes are sufficient for numerous 3D spatial reasoning tasks, there exist scenarios that necessitate a finer level of object details. We show examples of two such cases in Figure 3.10. Second, our reliance on existing 3D detectors introduces a constraint: the quality of 3D detection itself. In our experiments, we observed that even state-of-the-art 3D detection methods [280, 6] yield sub-optimal results, highlighting the room for improvement in 3D detection. Third, our method involves manual specification of the desired information to be extracted from 3D detections (e.g., each object’s center, size, and orientation), which has its limitations. An adaptive feature selection strategy could potentially yield better results. However, even with those limitations, our method surpasses all current multi-modality baselines. This achievement leads us to propose two critical insights: firstly, the connection module facilitating interaction between different modalities may not be as effective since our experiment can be roughly regarded as a controlled test that replaces a typical multi-modal cross-attention module with just text, and demonstrates that it works better; and secondly, the reasoning module might be too simplistic to ground the complex logic inherent in natural language. We believe that the observed limitations in current multi-modality methods can largely be attributed to the scarcity of annotated 3D data, which is orders-of-magnitude smaller than its 2D counterpart, primarily due to the higher cost in its collection. We call for more efficient data collection pipelines in 3D and robotics, potentially leveraging generative methods.

3.2.8 Conclusion

We introduce Transcrib3D, a simple and effective method for the 3D grounding of natural language referring expressions that requires no training, yet delivers state-of-the-art performance across leading 3D reference resolution benchmarks. The idea of using text as a unifying medium to connect the 3D scene and LLM reasoning not only achieves great results, but also provides critical insights into the bottleneck of current multi-modal methods.

3.3 State-maintaining language model for long-horizon embodied reasoning: Statler

There has been a significant research interest in employing large language models to empower intelligent robots with complex reasoning. Existing work focuses on harnessing their abilities to reason about the histories of their actions and observations. In this paper, we explore a new dimension in which large language models may benefit robotics planning. In particular, we propose Statler, a framework in which large language models are prompted to maintain an estimate of the world state, which are often unobservable, and track its transition as new actions are taken. Our framework then conditions each action on the estimate of the current world state. Despite being conceptually simple, our Statler framework significantly outperforms strong competing methods (e.g., Code-as-Policies) on several robot planning tasks. Additionally, it has the potential advantage of scaling up to more challenging long-horizon planning tasks.

3.3.1 Introduction

Large language models (LLMs) exhibit strong reasoning capabilities that are harnessed to perform a wide range of downstream tasks such as dialogue and code generation [281, 282, 283]. The robotics community has recently seen a significant interest in empowering robots with LLMs, enabling them to understand natural language commands and perform tasks that require sophisticated reasoning [262, 278, 284, 192]. However, existing methods are model-free: they use LLMs as policy functions that generate future actions only conditioned on previous actions and observations. In this paper, we propose a simple yet effective model-based approach. Our framework—named Statler—maintains a running estimate of the world state by prompting large language models and performs multistep embodied reasoning conditioned on the estimated state. Figure 2.9(a) illustrates this framework. In particular, Statler utilizes a pair of prompted LLMs: instructed by a few demonstrations, the world-state reader takes as input the user query, reads the estimated world state, and generates an executable action (e.g, a code snippet); instructed by another set of demonstrations, the world-state writer updates the world state estimate based on the action. This mechanism resembles how a domain-specific formal language tracks a symbolic world state [285], but enjoys greater flexibility since pretrained large language models are known to be domain-agnostic. As we will see soon in 3.3.5, the prompts in our experiments are generic and users of our framework will have minimal design workload. Our Statler framework is primarily inspired by classical model-based reinforcement learning. In a model-based approach, an environment (or world) model learns to capture the dynamics of the environment (e.g., possible outcomes of an action) so that the policy conditioned on the model state will take more informed actions [286]. In our framework, the LLMs have acquired massive amounts of commonsense knowledge from pretraining, and they are elicited—by a few demonstrations—to behave as an environment model, estimating the world state and facilitating decision making. Another motivation of our design is to handle missing data. In robotics tasks, we often have to cope with latent world dynamics that are not directly observable. In such scenarios, explicitly maintaining an estimated world state improves decision making, although the estimates might be imperfect. This is analogous to graphical models with latent variables: spelling out latent variables and imputing their values is often helpful for reasoning about the target variables, although the imputation may not be perfect [287]. The final motivation of our state-maintaining design is its potential to scale up to long-horizon planning tasks. In multistep reasoning and planning, an LLM has to implicitly maintain the world state in its internal representation, which has been demonstrated to be difficult in previous work [288, 289, 290, 291, 292, 293, 294]. By explicitly maintaining an estimated world state, our framework makes it easier to track and consult the world state at any step of reasoning and planning, thus carrying a higher chance of success in long-horizon tasks. In the following sections, we will show that our framework performs as expected: in 3.3.3, we demonstrate the concept with a pedagogical example; in 3.3.4, we introduce the Statler framework; in 3.3.5, we present the experiments, in which our framework significantly outperforms strong competing methods such as Code-as-Policies [278].

3.3.2 Related Work

**Language Understanding for Robotics ** A common approach for language understanding for robotic agents involves symbol grounding [295], whereby phrases are mapped to symbols in the robot’s world model. Early work [296, 297] relies upon hand-engineered rules to perform this mapping. More recent methods replace these rules with statistical models the parameters of which are trained on annotated corpora [298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309]. Other methods use neural network-based architectures to jointly reason over natural language utterances and the agent’s (visual) observations of the scene [228, 229, 230, 231, 232]. **LLMs for Robotics ** Since LLMs are trained with Internet-scale corpora, their infused common sense have shown to help in the domain of robotics in terms of high-level planning from natural language instructions [262, 278, 217] for both object manipulation [218, 219] and navigation tasks [220, 221, 222, 223]. Combining LLMs with expressive visual-language embeddings also enables impressive capabilities [310]. This has led to efforts to push for general multi-modality embodied models [311, 215]. **Code Generation with LLMs ** Code generation has been one of the most successful use cases for LLMs [282, 312, 313, 314, 315, 283]. Since code can connect with executable APIs for tasks including computation, vision and manipulation, a large chunk of work has focused on code generation with different tools [316, 317, 318]. In particular, Code-as-policies [278] is one of the first to use code generation within a robotics context. **State Representation in Reasoning ** The use of state representations have been shown to help in algorithmic reasoning tasks [319, 320]. Instead of using one forward pass to predict the execution result for the entire code snippet, Nye et al. [319] proposes to spell out step-by-step intermediate outputs to help infer the final execution results. Also relevant are research efforts that aim to enhance language modeling by rolling out possible future tokens [321]. **Language Models and Planning ** Recent work shows that vanilla and instruction-tuned LLMs plan poorly [291, 294, 293]. Some works propose using the LLM as an intermediary between natural language and a domain-specific programming language, and then uses a traditional planner [292, 293, 322]. Silver et al. [294] employ Chain-of-Thought and iterative reprompting with feedback on generated plans, but require GPT-4 for good performance. Xiang et al. [323] use parameter-efficient finetuning of LLMs on top of traces from a world-model and show improved performance on related tasks.

3.3.3 Motivating Example

We use three-cups-and-a-ball, a simple shell game, to demonstrate the effectiveness of our state-maintaining idea. In this game, a ball is covered under one of three identical cups and the initial position of the ball is known to the player. In each of $K$ rounds, we randomly swap two cups’ positions. Finally, we ask the player to guess the position of the ball. We present three separate cases of using LLMs to play this game using GPT-3 (precisely, text-davinci-003). Prompt 10 demonstrates the simplest case: We represent the state with Boolean variables with True indicating “ball is here”. We feed the initial state and the $K$ rounds of swaps into GPT-3, instructing it to complete the final state. Prompt 11 is an improved way: after reading $K$ rounds of swaps, GPT-3 is asked to give all the intermediate states over the game. This version is inspired by Chain-of-Thought prompting [324], which improves the performance of an LLM by requesting it to spell out its intermediate reasoning steps. Finally, Prompt 12 is a simple instantiation of our state-maintaining idea: we ask GPT-3 to return the current state immediately after reading each round of swaps, stimulating the model to track and update the state as the game progresses. We evaluate these methods with a range of $K$ ; for each $K$ and each method, we feed $30$ demonstrations with various numbers of swaps to the model, and repeat the experiment $100$ times. 3.12 visualizes the average accuracies. The state-maintaining method significantly outperforms the other methods, with the performance gap increasing with $K$ .

3.3.4 Methodology

As exemplified in 3.3.3, the key to our approach is to allow the LLM to describe the next state while responding to each user query. The motivating example is simple in that the next state is the response. Instead, we now consider more general scenarios where there is a significant burden on the LLM to track the state updates as well as generate responses. (Fig. 3.13). For the general cases, we propose to split the burden across multiple different prompted LLMs. Precisely, we maintain a separate prompt that includes instructions and demonstrations for each subtask (i.e., state-tracking or query-responding) and then use the prompt to elicit an LLM to perform the particular subtask. As we discuss shortly, our framework includes a world-state reader that responds to the user query and a world-state writer that is responsible for updating the state representation. Our framework (Fig. 2.9(a)) does not pose any fundamental limitation on which domain it can be applied to. Our approach can be regarded as a model-based extension of Code-as-Policies (CaP) in the sense that it keeps the core capabilities of CaP (e.g., hierarchical code generation) and incorporates a means to explicitly maintain an estimated world state. It is useful to consider example prompts to understand the operation of the reader and writer models. Prompt 13 is an example of the input passed to the world-state reader. Initially, we initialize a JSON-formatted state with a reference to object-oriented principles. Given a user query “Put the cyan block on the yellow block” (Line 13) and the current state representation (Lines 1–12), the world-state reader should generate the code that responds to the query, taking into account the current state. The expected code to be generated is highlighted in green. After generating the code, our model executes it to complete the query. When the state needs to be updated, the generated code will contain an update_wm function that triggers the world-state writer with the query specified in its argument. In Prompt 14, we show the corresponding example for the world-state writer. Similar to the reader, we prepend the current state representation before the user query and the model generates the updated state representation (highlighted in green).

3.3.5 Experiments

We evaluate the capabilities of Statler alongside state-of-the-art LLM models on three tabletop manipulation domains (Fig. 3.14): pick-and-place, block disinfection, and relative weight reasoning. For each domain, we design in-context examples and consider $20$ evaluation episodes each of which consists of $5$ – $16$ consecutive steps of user queries. Every episode contains at least one query that requires reasoning over the interaction history (i.e., requires “memory” across steps), which makes the task significantly challenging.

3.3.6 Simulated Tabletop Manipulation Domains

The Pick-and-Place domain involves scenarios that require a robot arm to sequentially pick up and place a block onto another block, bowl, or the table. The model needs to remember and reason over the block locations. The example user queries are “Put the green block in the red bowl.”, “What is the color of the block under the pink block?”, and “How many blocks are in the green bowl?”. In the Block Disinfection domain, we consider a scenario in which a block can be either dirty or clean, the state of which is not observable by the robot. When a clean block touches a dirty block (e.g., as a result of stacking one block on another), the clean block becomes dirty. There is a disinfector on the table that cleans any block placed inside it. This scenario emulates a clean-up task in which you might ask a robot to put dirty dishes in a dishwasher or dirty clothes in a washing machine. The user query contains pick-and-place commands similar to those in the pick-and-place domain as well as textual utterances that require reasoning over which blocks are clean and dirty, such as “Put all the clean blocks in the green bowl.” This domain presents a particular challenge as the model must track the cleanliness of each block and accurately capture the state mutations that happen when a dirty block comes into contact with another clean block. Relative Weight Reasoning involves memorizing and reasoning over the relative weights of the blocks. User queries provide information about the weight of blocks (e.g., “The red block is twice the weight of the bronze block.”), which are followed by queries that require reasoning over the weights (e.g., “Put blocks in the purple bowl so that their total weight becomes identical to what is in the gray bowl.”). We compare our proposed approach, Statler, to two strong competing methods: Code-as-Policies [278] (CaP) and CaP with Chain-of-Thought prompting [324] (CaP+CoT). CaP generates code for the current question at each step based on the past actions, but it does not maintain a state. Following the CoT framework, at every step, CaP+CoT deduces the intermediate states based on an initial state and past actions, which are considered as its thoughts, to generate the current code. But it leads to redundant reasoning and increases the length of the prompt, which may then exceed the LLM’s context window size limitations. Furthermore, longer reasoning also demands longer, more intricate demo example prompts, contributing to increased developer effort. We ensure that the demonstrations (i.e., in-context examples) given to each of the models are equivalent. Namely, we use the same sequence of user queries and code snippets, except for necessary differences due to their designs such as state representation.

Table 3.5 reports the episode success rates of each method along with the the success rate for individual steps. An episode is considered to be a failure if a model fails to respond to one of the user queries in the episode. While the CaP baseline correctly processes more than half of the individual steps in each domain, it fails to successfully complete any of the episodes. As we show later, CaP correctly processes most queries that do not require reasoning over previous steps (e.g.,“Put the red block on the blue block.”), but tends to generate incorrect (or no) code in response to queries that require reasoning over the history (e.g., “Put all the dirty blocks in the pink bowl.” and “What is the color of the block under the purple block?”) (see 3.15 (top)). CaP+CoT fares slightly better in the Pick-and-Place and Relative Weight Reasoning, but still fails in most episodes. In contrast, Statler successfully handles the majority of these queries, demonstrating strong improvement over the others. It should be noted we explicitly chose queries that were challenging for LLM-based models, which partially accounts for why our model’s scores show room for improvement.

In order to better understand the behavior of Statler in comparison to Code-as-Policies, we analyze the success rates based on the type of user queries. Specifically, we categorize each query as either temporal or non-temporal depending on whether responding to the query necessitates temporal reasoning. We emphasize that contemporary methods, including the baselines that we consider, use non-temporal queries for evaluation. Table 3.6 summarizes the resulting accuracy. The models often fail at different steps in an episode. We note that, when calculating accuracy we only consider the sequence of steps until the model fails to generate the correct code, which explains the mismatch in the denominators. We see that both models achieve near-perfect performance on commands that do not require temporal reasoning. However, the performance of CaP noticeably decreases for tasks that require reasoning over the past. In contrast, Statler achieves success rates of $83\%$ (vs. $31\%$ for CaP) on Pick-and-Place, $65\%$ (vs. $5\%$ for CaP) on Block Disinfection, and $55\%$ (vs. $0\%$ for CaP) on Relative Weight Reasoning.

Although our method enjoys a better robustness than the baseline methods, it inherits some issues of large language models, which hinders its performance. For example, it hallucinates block conditions (e.g., clean or dirty) or locations when the cleanliness of the block is never explicitly described. Moreover, the model’s reasoning strategy seems to predominantly treat weight as an abstract concept, e.g. light vs. heavy, rather than executing mathematical computations. This weakness is evident when asking the model to accumulate blocks in a bowl until their total weight surpasses that of another bowl, yet the model underfills the bowl. In the disinfection domain, our model struggles to comprehend ambiguous terms like “other” in queries such as “the other blocks are clean.” It can also wrongly infer from the training prompt that a block at the bottom becomes dirty when a block is placed on top of it, irrespective of the latter’s cleanliness.

3.3.7 Real Robot Experiments

In order to validate Statler on a real robot, we implement it on a UR5 arm in a similar tabletop domain as the simulated experiments. We use MDETR [279], an open-vocabulary segmentation model, to obtain segmentation masks for objects within an RGB image captured by a RealSense camera mounted on the gripper. Using the predicted object mask and the depth image, the object point-cloud can be obtained, from which we estimate its center position and bounding box. All of the primitive functions are identical to those used in simulation. In this domain, the robot is asked to stack objects and cover objects with different colored cups. At any point, an object is only permitted to be covered by at most a single object or cover. If the robot is asked to manipulate the bottom object, it must put away the top one. If a new cover or object is to be stacked on it, the existing one must be removed. We evaluate the performance of Statler vs. CaP in the real robot domain using $10$ episodes. Statler achieves episode and step (in parentheses) success rate of 40%

$(70\%)$

, where $67\%$ of the failure cases are due to LLM reasoning while others are caused by either perception or manipulation issues. The success rate for CaP is 20%

$(46\%)$

, where LLM reasoning accounts for $88\%$ of failures. In Figure 3.16, we also provide a short example where the CaP baseline fails. The difficulty is in recognizing that yellow block is hidden under the black cup, which must be removed before picking up the yellow block as Statler correctly spots. Instead, CaP is not aware of this and tries to pick up the yellow block nonetheless, which leads MDETR to incorrectly detect the toy wheel that has yellow color in it as the yellow block.

3.3.8 State-Maintenance Ablations

To better understand Statler’s state-maintenance strategy, we consider three different approaches to tracking the state.

The first (Statler-Unified) employs a single LLM as both the world-state reader and writer using a prompt that interleaves Statler’s reader and writer prompts. At each step, the LLM first generates the action and then predicts the state that results from executing that action. The LLM then uses the resulting state when reasoning over the next query. Using a single LLM is conceptually simple, but it incurs an added burden for reasoning and generalization. Inspired by InstructGPT [325], the second (Statler-Auto) does not receive any in-context state-update examples for the world-state writer. Instead, we provide a natural language description of how the state should be maintained. Prompt 15 shows the relevant portion of the prompt. With an instruction and no in-context state-update examples, we ran our model on the same set of tasks. The third (Statler w/o State) ablates the world-state maintenance components of Statler entirely, resulting in a model that reduces to Code-as-Policies.

Table 3.7 compares the performance of Statler to the three variations in terms of both their full-episode completion rates (using $20$ episodes for each domain) as well their individual step success rates. Without maintaining the world-state, Statler w/o State fails to complete any episodes (recall that an episode is considered to be a failure if the model fails to respond to one of the user queries during the episode) and results in individual step success rates that are significantly lower than Statler. Meanwhile, we see that Statler’s use of separate LLMs for the world-state reader and world-state writer results in consistently higher episode success rates compared with the use of a unified reader and writer (Statler-Unified). The individual step success rates are higher for Pick-and-Place and Block Disinfection, and comparable for Relative Weight Reasoning. With regards to Statler’s use of separate LLMs for the world-state writer and reader, we note that in-context learning has been shown to be sensitive to variations in prompt templates, the order of examples, and the examples used [326, 327]. In light of this, it is plausible that the performance gains that we achieve by dividing our reader and writer may be attributed in part to this sensitivity, allowing the models to, in effect, become specialized at their respective tasks. Interestingly, Statler-Auto performs noticeably better than Statler and Statler-Unified with regards to the episode success rate on the Pick-and-Place and Block Disinfection domains, but comparable to Statler in terms of the individual success rates, and worse for Relative Weight Reasoning.

3.3.9 Conclusion

We presented Statler, a language model that maintains an explicit representation of state to support longer-horizon robot reasoning tasks. Integral to Statler are a world-state reader that responds to a user query taking into account the current internal state, and a world-state writer that maintains the world state. Evaluations on various simulated and real robot manipulation tasks reveal that Statler significantly outperforms contemporary models on non-trivial tasks that require reasoning over the past. Ablations demonstrate the contributions of our world-state reader and writer, and suggest Statler’s flexibility to the state representation.

Chapter 4 Conclusion and Discussion

This thesis investigated two major components required for Embodied Spatial Intelligence:

Robotic scene representation through implicit modeling, and 2. 2.

Embodied spatial reasoning with hybrid systems.

Summary of Contributions

On scene representation, this thesis first showed how self-supervised learning can enable camera self-calibration from image sequences without a calibration target (Section 2.1) [28]. It then introduced DeFiNe, a continuous depth field that achieves state-of-the-art generalization by reducing architectural bias and introducing a novel 3D data augmentation strategy (Section 2.2) [16]. Finally, it presented NeRFuser, a method to scale typically local neural radiance fields to building- and city-scale scenes via NeRF registration and blending (Section 2.3) [21]. Collectively, these contributions advance the creation of high-fidelity, large-scale 3D scene representations from 2D observations. While implicit representations have enabled remarkable progress in novel view synthesis [20] and 3D semantic search [328], their application to robotics tasks with strong classical alternatives, such as visual SLAM [31], remains competitive but not yet universally superior [142, 143]. On spatial reasoning, this thesis introduced MANGO, a benchmark that probes the mapping and navigation competence of LLMs from textual traversals, revealing significant gaps compared to human performance (Section 3.1) [25]. To bridge the gap between 3D perception and language-based reasoning, it proposed Transcrib3D, which translates local 3D geometry into text and uses tool-augmented, iterative reasoning to achieve state-of-the-art 3D referring-expression resolution (Section 3.2) [23]. Lastly, it presented Statler, a dual-LLM architecture that explicitly maintains a symbolic world state to improve long-horizon planning and execution (Section 3.3) [27].

4.1 A Paradigm Shift: The Rise of 2D Large-Scale Models

During the course of this PhD, the research landscape was reshaped by the ascent of large-scale models trained predominantly on 2D data. Vision-language models (VLMs) have markedly improved language-grounded robotics, from navigation [329, 222, 330] to manipulation [331, 332, 333]. This trend has culminated in vision-language-action (VLA) models that are trained end-to-end to output robot actions directly from 2D images and language commands [334, 335, 336, 337, 338]. This synergy extends to generative models. Language-conditioned video generators like Veo3 [339] can synthesize scenes with plausible physics and notable 3D consistency. More impressively, action-conditioned world models like Genie3 [340, 341] can generate minute-long, visually consistent video sequences from actions alone, apparently without an explicit 3D rendering engine. In a sense, these models demonstrate an emergence of 3D understanding from massive 2D datasets.

4.2 A New ”Cloud” on the Horizon: Are 2D Models Sufficient?

This rapid progress prompts a critical question: are 2D large models the final answer to embodied intelligence, with progress bounded only by scale? While this is a compelling possibility, I recall the “two clouds” metaphor from Lord Kelvin 125 years ago. At a time when the physics community believed its theoretical foundations were largely complete, Kelvin highlighted two lingering anomalies. These seemingly small questions ultimately led to the revolutions of general relativity and quantum mechanics. Similarly, a few ”clouds” linger over the apparent dominance of 2D models. The primary cloud concerns their persistent weakness on geometric and arithmetic tasks [342, 343]. Performance on tasks like precise counting, metric depth estimation, and 3D object detection still lags behind specialized, state-of-the-art models. Indeed, even the most capable LLMs today struggle with simple character counting or arithmetic without deferring to external tools. Built upon the same Transformer architecture, it is unsurprising that VLMs inherit these limitations. This deficit is also apparent in generation [344, 345, 346, 347]; while generated videos are visually plausible, close inspection reveals that their implicit 3D structure often violates geometric principles, such as maintaining consistent vanishing points. Thus, the current emergence of 3D from 2D pretraining is, at best, approximate.

4.3 The Role of 3D in the Large-Model Era

Given the abundance of 2D data and the relative scarcity of 3D labels, 3D should not be seen as a competitor to 2D models, but as a complementary, high-leverage component. Its role is threefold:

3D as Ground Truth for Validation and Diagnosis. High-quality 3D signals are indispensable for rigorously evaluating whether models pretrained on 2D data have developed genuine spatial understanding. This applies to both perception (are the model’s beliefs about the world geometrically sound?) and generation (are the model’s simulated worlds physically consistent?). 2. 2.

3D as a Conditioning for Consistent Generation. While large-scale video models demonstrate impressive short-term consistency, maintaining coherence over long horizons remains a fundamental challenge. Because the physical world is governed by time-consistent 3D geometry, an explicit 3D representation can serve as a powerful memory or scaffold to condition video generation, enforcing global consistency. Early results in 3D-conditioned generation are already promising [348, 349, 350]. 3. 3.

3D as a Target for Supervised Fine-Tuning. After broad pretraining on 2D internet data, smaller, high-quality multi-view 3D datasets can be used to efficiently instill robust spatial priors in a model. This targeted fine-tuning can improve downstream spatial reasoning and control, providing a crucial bridge before more expensive reinforcement learning or deployment-specific adaptation.

Closing Remarks

The path forward in robotics will undoubtedly be dominated by scaling models on vast 2D datasets, as this approach has proven tractable and effective for learning general-purpose semantic capabilities. The emergent 3D understanding from this process will only improve with scale. However, for embodied agents that must act with precision, safety, and long-horizon consistency in the physical world, this thesis argues that 3D is indispensable. The research presented here supports a vision where the general semantic power of 2D large models is grounded and refined by the geometric and physical rigor of 3D structure. Explicit 3D serves as a critical tool for inducing, conditioning, and verifying the spatial intelligence of these systems. The strategic fusion of scalable 2D learning with structured 3D knowledge represents a practical and robust path toward the ultimate goal of creating truly capable Embodied Spatial Intelligence.

Bibliography350

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Usenko et al. [2018] V. Usenko, N. Demmel, and D. Cremers, “The double sphere camera model,” in Proceedings of the International Conference on 3D Vision (3DV) , 2018, pp. 552–560.
2Usenko et al. [2020] V. Usenko, N. Demmel, D. Schubert, J. Stueckler, and D. Cremers, “Visual-inertial mapping with non-linear factor recovery,” IEEE Robotics and Automation Letters , vol. 5, no. 2, pp. 422–429, 2020.
3Burri et al. [2016] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The Eu Ro C micro aerial vehicle datasets,” International Journal of Robotics Research , vol. 35, no. 10, pp. 1157–1163, 2016.
4Gordon et al. [2019] A. Gordon, H. Li, R. Jonschkowski, and A. Angelova, “Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2019.
5Jiang et al. [2020] L. Jiang, H. Zhao, S. Shi, S. Liu, C.-W. Fu, and J. Jia, “Point Group: Dual-set point grouping for 3D instance segmentation,” in Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR) , 2020.
6Schult et al. [2023] J. Schult, F. Engelmann, A. Hermans, O. Litany, S. Tang, and B. Leibe, “Mask 3D: Mask transformer for 3D semantic instance segmentation,” in Proc. IEEE Int’l Conf. on Robotics and Automation (ICRA) , 2023.
7Kamath et al. [2023] A. Kamath, J. Hessel, and K.-W. Chang, “What’s “up” with vision-language models? investigating their struggle with spatial reasoning,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2023.
8Chen et al. [2024] B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2024.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Embodied Spatial Intelligence: From 3D Perception to 3D Reasoning

Abstract

Acknowledgments

Contents

List of Figures

List of Tables

Chapter 1 Introduction

1.1 From 2D to 3D: Building Implicit Scene Representations

1.1.1 Monocular Depth Estimation: A Canonical 2D–3D Mapping

1.1.2 Are General 3D Inductive Biases Learnable?

1.1.3 Scalable Scene Representations

1.2 Acting in 3D: Task-Specific Requirements

Short Response, Short Range (Manipulation).

Short Response, Long Range (Autonomous Driving).

Long Response, Long Range (Navigation and Exploration).

Long Response, Short Range (Multi-step Task Planning).

1.3 Thesis Scope and Roadmap

Chapter 2 Robotic Scene Representations from Implicit Modeling

2.1 Implicit camera estimation for robust scene understanding: Self-supervised camera self-calibration

2.1.1 Introduction

2.1.2 Related Work

2.1.3 Methodology

2.1.3.1 Self-Supervised Monocular Depth Estimation

2.1.3.2 End-to-End Self-Calibration

2.1.4 Experiments

2.1.4.1 Datasets

2.1.4.2 Training Protocal

2.1.4.3 Camera Self-Calibration

2.1.4.4 Camera Rectification

2.1.4.5 Camera Re-calibration: Perturbation Experiments

2.1.4.6 Depth Estimation

2.1.4.7 Computational Cost

2.1.5 Conclusion

2.2 Implicit generalizable depth field modeling by 3D data augmentation: DeFiNe

2.2.1 Introduction

2.2.2 Related Work

2.2.2.1 Monocular Depth Estimation.

2.2.2.2 Multi-view Stereo.

2.2.2.3 Video Depth Estimation.

2.2.2.4 Novel View Synthesis.

2.2.3 Methodology

2.2.3.1 Perceiver IO

2.2.3.2 Input-Level Embeddings

2.2.3.3 Geometric 3D Augmentations

2.2.3.4 Decoders

2.2.3.5 Losses

2.2.4 Experiments

2.2.4.1 Datasets

2.2.4.2 Stereo Depth Estimation

2.2.4.3 Ablation Study

2.2.4.4 Video Depth Estimation

2.2.4.5 Depth from Novel Viewpoints

2.2.5 Conclusion

2.3 Implicit representation as modular maps for scalable scene modelling: NeRFuser

2.3.1 Introduction

2.3.2 Related Work

2.3.3 Methodology

2.3.3.1 Registration from NeRF Re-rendering

2.3.3.2 Sampled-based NeRF Blending

2.3.3.3 Sample-wise Blending with IDW

2.3.4 Experiments

2.3.4.1 Datasets

2.3.4.2 NeRF Fusion (Registration plus Blending)

2.3.4.3 NeRF Registration

2.3.4.4 NeRF Blending

2.3.4.5 Ablation Studies

2.3.5 Conclusion

Chapter 3 Embodied Spatial Reasoning with Hybrid Systems

3.1 Benchmarking Mapping and Navigation Capabilities of LLMs: MANGO

3.1.1 Introduction

3.1.2 Related Work

3.1.3 A Benchmark for Text-Based Mapping and Navigation

3.1.3.1 Maze Collection: From Game Walkthroughs to Mazes