TL;DR
LEO-VL introduces an efficient scene representation and training scheme for 3D vision-language models, enabling scalable learning, improved robustness, and state-of-the-art performance across multiple benchmarks.
Contribution
The paper proposes CFG, a compact scene representation, and SceneDPO, a novel post-training objective, advancing 3D VLMs in scalability, robustness, and task diversity.
Findings
LEO-VL achieves state-of-the-art results on 3D-VL benchmarks.
CFG significantly reduces token overhead while maintaining perceptual capacity.
SceneDPO enhances model robustness through contrastive signals.
Abstract
Developing vision-language models (VLMs) capable of understanding 3D scenes has been a longstanding research goal. Despite recent progress, 3D VLMs still struggle with spatial reasoning and robustness. We identify three key obstacles hindering their progress: (1) scene representation is constrained by a capacity-efficiency trade-off, which impedes scalable learning; (2) training data lacks a comprehensive scheme, with limited diversity across tasks and scene domains; and (3) models exhibit robustness deficiencies and lack effective post-training. To address these challenges, we first propose condensed feature grid (CFG), an efficient scene representation that significantly reduces token overhead while preserving strong perceptual capacity. Building on CFG, we introduce LEO-VL, a 3D VLM trained on over 700k 3D vision-language (3D-VL) data spanning four real-world indoor domains and five…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The motivation to reduce scene tokens is reasonable and addresses an important problem in the field. 2. LEO-VL achieves state-of-the-art results on several benchmarks. 3. The paper also introduces a new 3D-VL dataset, which adds value to the community.
1. My main concern lies in the fairness of the benchmark comparisons. Although LEO-VL achieves state-of-the-art results on several benchmarks, its VLM backbone and training data differ from those of the compared baselines. Therefore, it is difficult to verify whether the proposed CFG is truly more efficient and effective than previous methods. 2. Regarding the evaluation benchmarks, although it is common practice in the 3D-VL field to use image captioning metrics such as CIDEr and machine transl
1. The Condensed Feature Grid is an elegant and effective contribution. It achieves a ~3x token reduction (33% compression rate) compared to the raw voxel grid, appearing to improve performance by preserving spatial information. 2. The analysis in Section 4.3 demonstrating that scaling with 669k of simplistic, low-quality QA data degrades performance is an important contribution. This focus on "quality over scale" is a useful finding for the field. 3. The model achieves new SOTA performance on
1. While the model is trained on five tasks, the primary evaluation is almost exclusively on QA. The model's proficiency at other tasks (grounding, captioning) is not as rigorously benchmarked. 2. The vertical condensation (pooling) inherently loses information about the vertical distribution of features within a single (x, y) pillar. The paper argues RoPE encoding of height mitigates this, but there is no analysis of failure cases. For example, it's unclear if the model can distinguish a cup on
1. The overall writing of the paper is clear and easy to follow. 2. The proposed token reduction method achieves a good trade-off between performance and efficiency. 3. The authors conduct comprehensive experiments involving various post-training methods and effectively demonstrate the effectiveness of SceneDPO.
1. Recent relevant works, such as VG-LLM [1] and 3DRS [2], have not been cited or compared in the paper. Including a discussion or comparison with these approaches would strengthen the related work section and contextualize the contributions of this work. 2. The CFG method compresses height information, which may cause the model to lose fine-grained details of layered objects (e.g., a pillow on a bed or a pot on a table). This limitation could impact the model's ability to accurately represent
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
