Exploring the Potential of Encoder-free Architectures in 3D LMMs
Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Bin Zhao, Xuelong Li

TL;DR
This paper introduces ENEL, the first encoder-free 3D large multimodal model, demonstrating its competitive performance and potential to replace traditional encoder-based architectures in 3D understanding tasks.
Contribution
It presents a novel encoder-free architecture for 3D LMMs, including new semantic encoding and hierarchical geometry aggregation strategies, achieving state-of-the-art results.
Findings
ENEL achieves competitive results on classification, captioning, and VQA tasks.
Encoder-free architecture shows high potential for 3D understanding.
The code is publicly available for further research.
Abstract
Encoder-free architectures have been preliminarily explored in the 2D Large Multimodal Models (LMMs), yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to alleviate the challenges of encoder-based 3D LMMs. These long-standing challenges include the failure to adapt to varying point cloud resolutions during inference and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the pre-trained encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid…
Peer Reviews
Decision·ICLR 2026 Poster
1. It is interesting to consider encode-free alternative for 3D LLM solutions. It has the potential to overcome the restricted resolution issue, and to reduce the semantic gaps between encoder and LLM. 2. The proposed approach achieves good performance on 3D benchmarks and sometimes is state of the art.
1. The approach is quite incremental in terms of contributions. 1) For hybrid loss choices, combining masked and reconstruction loss seems less exciting to me. 2) The geometry aggregation is designed specifically for 3D point clouds. In some way, it moves the functionality of 3D point encoder inside the LLM. Although it might work, this leads to the loss of generality of using an general-purpose LLM. 2. Back to the claims of the paper. Two limitations motivate this work. 1) The variable resol
1. The paper conducts extensive ablation studies comparing various encoder-free strategies, including different self-supervised losses and architectural components. This thorough investigation provides valuable insights and practical guidance for the community when designing 3D LMMs. 2. The experiments cover multiple benchmarks across different tasks, with both GPT-4 evaluation and traditional metrics reported.
1. All core investigations are built on PointLLM (Aug 2023), a 2-year-old model, while conveniently citing ShapeLLM (2024) only in Table 5. This is fundamentally problematic—the rapid evolution of LLMs makes findings from ancient baselines nearly irrelevant. The authors fail to justify why encoder-free strategies weren't explored on ShapeLLM, raising serious questions about whether these "insights" generalize at all or merely exploit weaknesses of an obsolete architecture. 2. Figure 3 is confus
This paper poses the ambitious question of whether 3D LMMs can be "encoder-free." It clearly articulates two key challenges motivating this approach—resolution mismatch and semantic misalignment—using qualitative visualisations. To address these, the authors present a concrete system (ENEL) featuring two main contributions: 1. A thorough empirical exploration of self-supervised objectives (including masked modelling, reconstruction, KD, and contrastive), which culminates in a final "Hybrid" lo
1. “Encoder-free” claim is overstated. The “token embedding module” is explicitly a lightweight Point-PN–style hierarchy: FPS downsampling, k-NN grouping, and learnable layers repeated 2–4 times, followed by a projection. This is a parameter-efficient encoder, not the near-identity/VQ/linear tokenizer style commonly implied by “encoder-free” in 2D works. The paper itself calls it “a lightweight variant of Point-PN.” Please either narrow the claim or compare against true encoder-free tokenizers (
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModular Robots and Swarm Intelligence · 3D IC and TSV technologies · Advanced Materials and Mechanics
MethodsFocus
