Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding

Yerim Jeon; Miso Lee; WonJun Moon; and Jae-Pil Heo

arXiv:2512.02487·cs.CV·March 25, 2026

Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding

Yerim Jeon, Miso Lee, WonJun Moon, and Jae-Pil Heo

PDF

Open Access

TL;DR

This paper introduces 3D-SLIM, a novel masking strategy for LLMs that enhances 3D scene-language understanding by aligning attention mechanisms with spatial structures, significantly improving reasoning capabilities without extra parameters.

Contribution

The paper proposes 3D-SLIM, an adaptive attention masking method that replaces causal masks with spatially-aware masks, enabling better 3D reasoning in LLMs without architectural changes.

Findings

01

Substantial performance improvements across multiple benchmarks.

02

Effective spatially-aware attention without additional parameters.

03

Validation across diverse 3D scene-language tasks.

Abstract

Recent advances in 3D scene-language understanding have leveraged Large Language Models (LLMs) for 3D reasoning by transferring their general reasoning ability to 3D multi-modal contexts. However, existing methods typically adopt standard decoders from language modeling, which rely on a causal attention mask. This design introduces two fundamental conflicts in 3D scene understanding: sequential bias among order-agnostic 3D objects and restricted object-instruction attention, hindering task-specific reasoning. To overcome these limitations, we propose 3D Spatial Language Instruction Mask (3D-SLIM), an effective masking strategy that replaces the causal mask with an adaptive attention mask tailored to the spatial structure of 3D scenes. Our 3D-SLIM introduces two key components: a Geometry-adaptive Mask that constrains attention based on spatial density rather than token order, and an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · 3D Shape Modeling and Analysis