ObitoNet: Multimodal High-Resolution Point Cloud Reconstruction
Apoorv Thapliyal, Vinay Lanka, Swathi Baskaran

TL;DR
ObitoNet introduces a multimodal transformer-based framework that combines image semantics and geometric details to achieve high-resolution point cloud reconstruction, improving robustness in sparse or noisy data scenarios.
Contribution
It presents a novel integration of Vision Transformers and point cloud tokenization with a transformer decoder for enhanced 3D reconstruction.
Findings
Effective in reconstructing high-resolution point clouds
Robust performance with sparse and noisy data
Combines semantic and geometric features successfully
Abstract
ObitoNet employs a Cross Attention mechanism to integrate multimodal inputs, where Vision Transformers (ViT) extract semantic features from images and a point cloud tokenizer processes geometric information using Farthest Point Sampling (FPS) and K Nearest Neighbors (KNN) for spatial structure capture. The learned multimodal features are fed into a transformer-based decoder for high-resolution point cloud reconstruction. This approach leverages the complementary strengths of both modalities rich image features and precise geometric details ensuring robust point cloud generation even in challenging conditions such as sparse or noisy data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRemote Sensing and LiDAR Applications · 3D Surveying and Cultural Heritage · Image Processing and 3D Reconstruction
MethodsSoftmax · Attention Is All You Need
