Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding
Yongyuan Liang, Xiyao Wang, Yuanchen Ju, Jianwei Yang, Furong Huang

TL;DR
Lemon is a unified transformer model that effectively processes 3D point cloud data and language jointly, enabling scalable and comprehensive 3D spatial understanding and reasoning.
Contribution
It introduces a novel unified architecture for 3D multimodal understanding that integrates point cloud and language processing, improving scalability and performance.
Findings
Achieves state-of-the-art results on 3D understanding tasks
Demonstrates robust scaling with increased model size and data
Supports diverse tasks from object recognition to spatial reasoning
Abstract
Scaling large multimodal models (LMMs) to 3D understanding poses unique challenges: point cloud data is sparse and irregular, existing models rely on fragmented architectures with modality-specific encoders, and training pipelines often suffer from instability and poor scalability. We introduce Lemon, a unified transformer architecture that addresses these challenges by jointly processing 3D point cloud patches and language tokens as a single sequence. Unlike prior work that relies on modality-specific encoders and cross-modal alignment modules, this design enables early spatial-linguistic fusion, eliminates redundant encoders, improves parameter efficiency, and supports more effective model scaling. To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context, and a three-stage training curriculum that progressively…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
esults across object/scene tasks and qualitative comparisons vs 2D VLMs (e.g., GPT-4V) are extensive; Lemon-7B appears competitive on several benchmarks.
The paper contrasts Lemon with “non-transformer” encoders, but PointLLM and ShapeLLM are also transformer-based. The real distinction is removing the separate 3D encoder block in favor of early fusion inside one transformer—this needs to be framed clearly and compared fairly. Lemon still performs hierarchical geometric patchification (splits/FPS/standardization) plus a projector—effectively a tokenizer with hierarchy. Without compute/param attribution, the claim of eliminating redundant encoder
1. Unlike encoder-based methods, this paper explores an **encoder-free paradigm**, representing a novel design direction. 2. The **3D patch tokenization** strategy is compelling and may flexibly support both scene-level and object-level 3D representation.
1. It remains unclear how performance would change if the tokenization method were not patch-based, but instead adopted a **hierarchical downsampling scheme** similar to PointBERT. 2. The comparison setup may not be fully fair: although re-finetuning balances the training distribution, the **training data size and LLM base models differ** across methods. It would be more convincing to evaluate with **Vicuna-1.1 and PointLLM data/training stages** to demonstrate encoder-free potential. 3. The bas
1. Lemon does not rely on a pre-trained 3D encoder but instead develops a structured patchification and tokenization scheme, which is an interesting design choice. However, the authors do not provide an ablation study to justify the effectiveness of this component. 2. The proposed three-stage training curriculum progressively builds the model’s capability from object-level to scene-level understanding.
1. The paper’s main claimed contribution is the unified architecture for 3D point clouds and language. However, in my view, prior works such as **PointLLM** have already adopted a unified token-based framework for fine-tuning LLMs using both 3D point tokens and text tokens. Lemon mainly replaces the 3D encoder with FPS sampling and a linear projector, without introducing substantial architectural novelty. Therefore, I remain skeptical about the level of innovation. 2. The discussion of related w
1. The paper proposes a strong 3D-LLM with a simple but effective point cloud tokenization strategy. 2. It provides evidence that conventional 3D encoders may limit general 3D understanding performance. 3. The proposed LEMON model achieves state-of-the-art results across diverse 3D understanding benchmarks under the presented training pipeline.
1. **Potential unfair comparison with baselines:** Most baseline models are trained on smaller or different datasets. A controlled fine-tuning experiment (e.g., on ScanQA and SQA in Stage-3) is recommended to isolate the effect of training data size. 2. **Limited evaluations:** The evaluation on 3D-GRAND is primarily used to assess hallucination in 3D-LLMs rather than to rigorously examine their 3D spatial understanding ability. The details of the *100 challenging 3D spatial QA* set (Line 264)
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · 3D Shape Modeling and Analysis · Advanced Neural Network Applications
