Fast SceneScript: Fast and Accurate Language-Based 3D Scene Understanding via Multi-Token Prediction
Ruihong Yin, Xuepeng Shi, Oleksandr Bailo, Marco Manfredi, Theo Gevers

TL;DR
Fast SceneScript introduces a multi-token prediction approach with confidence-guided decoding to enable fast and accurate 3D scene understanding using language models, significantly reducing inference time.
Contribution
It proposes a novel multi-token prediction method combined with confidence-guided decoding, improving speed and efficiency in language-based 3D scene understanding.
Findings
Fast SceneScript achieves up to 9 tokens per inference step.
It adds only about 7.5% additional parameters.
The method maintains high accuracy on synthetic and real-world benchmarks.
Abstract
Recent perception-generalist approaches based on language models have achieved state-of-the-art results across diverse tasks, including 3D scene layout estimation and 3D object detection, via unified architecture and interface. However, these approaches rely on autoregressive next-token prediction, which is inherently slow. In this work, we introduce Fast SceneScript, a novel structured language model for accurate and efficient 3D scene understanding. Our method employs multi-token prediction (MTP) to reduce the number of autoregressive iterations and significantly accelerate inference. While MTP improves speed, unreliable token predictions can significantly reduce accuracy. To filter out unreliable tokens, we adapt self-speculative decoding (SSD) for structured language models and introduce confidence-guided decoding (CGD) with an improved scoring mechanism for token reliability.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
