LiTo: Surface Light Field Tokenization
Jen-Hao Rick Chang, Xiaoming Zhao, Dorian Chan, Oncel Tuzel

TL;DR
LiTo introduces a unified 3D latent representation that captures both geometry and view-dependent appearance, enabling realistic rendering of effects like specular highlights and reflections from RGB-depth samples.
Contribution
This work presents a novel surface light field tokenization method that jointly models geometry and view-dependent appearance in a compact latent space.
Findings
Achieves higher visual quality than existing methods.
Successfully reproduces complex view-dependent effects.
Enables consistent appearance generation conditioned on a single image.
Abstract
We propose a 3D latent representation that jointly models object geometry and view-dependent appearance. Most prior works focus on either reconstructing 3D geometry or predicting view-independent diffuse appearance, and thus struggle to capture realistic view-dependent effects. Our approach leverages that RGB-depth images provide samples of a surface light field. By encoding random subsamples of this surface light field into a compact set of latent vectors, our model learns to represent both geometry and appearance within a unified 3D latent space. This representation reproduces view-dependent effects such as specular highlights and Fresnel reflections under complex lighting. We further train a latent flow matching model on this representation to learn its distribution conditioned on a single input image, enabling the generation of 3D objects with appearances consistent with the…
Peer Reviews
Decision·ICLR 2026 Poster
1. The writing is clear and easy to follow. 2. The proposed pipeline appears novel; however, the motivation could be made more convincing or better justified.
Unclear motivation: The motivation behind the proposed method remains unclear. Existing approaches typically avoid incorporating view-direction information because doing so simplifies subsequent relighting tasks. If all lighting- and view-related information are modeled jointly, it becomes questionable how the proposed model can perform relighting and be naturally integrated into a scene without introducing inconsistent illumination or lighting variations. The authors should clarify the motivati
- The paper is well written and mostly easy to follow. - Modeling of view-dependent appearance while retaining performance on geometry modeling is a novel and valuable contribution. - Experiments and ablations are sufficient to evaluate architecture and performance. LiTo consistently achieves strong empirical results.
- Some architectural details are missing for reproducibility (see questions).
1. Comprehensive experiments. The experimental section is thorough, with well-designed ablation studies that clearly justify key architectural and training choices. 2. High reconstruction fidelity. The method achieves superior fidelity in input images, which is an important aspect of 3D generation quality.
- The view-dependent color entangles the representation with environment lighting. While this improves reconstruction fidelity, it limits the method’s applicability for tasks requiring relighting or lighting-invariant representations.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques · 3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis
