POC-SLT: Partial Object Completion with SDF Latent Transformers
Faezeh Zakeri, Raphael Braun, Lukas Ruppert, Henrik P.A. Lensch

TL;DR
This paper introduces POC-SLT, a transformer-based method operating on latent SDF patches for 3D shape completion from partial data, showing significant improvements over existing methods.
Contribution
It proposes a novel transformer approach on latent SDF patches for 3D shape completion, leveraging a VAE for smooth latent encoding and outperforming state-of-the-art techniques.
Findings
Outperforms baseline methods in shape completion quality
Effective on partial observations from ShapeNet and ABC datasets
Significant quantitative and qualitative improvements
Abstract
3D geometric shape completion hinges on representation learning and a deep understanding of geometric data. Without profound insights into the three-dimensional nature of the data, this task remains unattainable. Our work addresses this challenge of 3D shape completion given partial observations by proposing a transformer operating on the latent space representing Signed Distance Fields (SDFs). Instead of a monolithic volume, the SDF of an object is partitioned into smaller high-resolution patches leading to a sequence of latent codes. The approach relies on a smooth latent space encoding learned via a variational autoencoder (VAE), trained on millions of 3D patches. We employ an efficient masked autoencoder transformer to complete partial sequences into comprehensive shapes in latent space. Our approach is extensively evaluated on partial observations from ShapeNet and the ABC dataset…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper is easy to understand. 2. The experimental results are better than the compared methods.
The novelty of this paper is very limited. 1. The division of high-resolution voxels into smaller patches is reasonable but is straightforward and thus of limited innovation. 2. The p-VAE is almost identical to the original vae without any adaptation or improvements for this task. 3. The SDF-Latent-Transformer idea is very similar to the masked autoencoder [1], so it lacks novelty. In short, the method proposed in this paper is somewhat like a combination of different well-known models, ther
1. The authors proposed an efficient architecture that runs much faster than previous diffusion-based methods and auto-regressive-based methods and still demonstrated decent quantitative and qualitative results on shape completion benchmarks. The efficient design and fast running speed are appreciated for possible real-world applications. 2. This paper addressed an interesting problem of 3D shape completion, which could lead to possible applications in 3D reconstruction and robotics. 3. The au
1. Lack of ablative study on the resolutions of the patch size and number of patches. The 32^3 SDF volume seems to be a relatively large SDF with lots of information. It will be good to have an ablative study with smaller patch sizes, such as 16^3 or 8^3. 2. The presentation of the proposed method is vague and confusing. It will be much easier for the reader to understand if the authors point out they only use a transformer with 8x8x8 context length, and each token encodes the information of a 3
* Fast inference time: a key advantage over other sequential (eg autoregressive) approaches is utilizing the MAE decoder. * Modular Patch-VAE architecture enables generalization by pre-training on small-scale patches, an effective component, even if previously used in related works. * High-quality shape completion, especially in capturing fine-grained details. * Simple yet effective approach, that avoids unnecessary complexity with a straightforward architecture and objective.
- Potential "leakage" issue: Non-masked voxels adjacent to masked patches may encode distances to missing parts, indirectly leaking information about regions to be completed. Discussing this limitation and potential remedies (e.g., using TSDF instead of SDF) would strengthen the work. Beyond conducting an ablation study of usage of SDF vs TSDF, first it should be qualitatively checked how much information is in fact encapsulated in non-masked patches, which regards the masked patches. Masking in
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Data Classification
MethodsApproximate Bayesian Computation
