TL;DR
Fase3D introduces an encoder-free, Fourier-based 3D large multimodal model that efficiently processes large-scale point cloud data without heavy visual encoders, achieving competitive performance.
Contribution
The paper presents Fase3D, a novel encoder-free 3D LMM using Fourier transforms for efficient, scalable, and permutation-invariant scene understanding.
Findings
Fase3D matches encoder-based models in performance.
It significantly reduces computational cost and parameters.
The approach effectively models global context in large 3D scenes.
Abstract
Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We propose Fase3D, the first efficient encoder-free Fourier-based 3D scene LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform (FFT) to approximate self-attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
