Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT
Alaa Asfour, Christopher Indris, Leihan Chen, Tejas Vyas, Guanghui Wang

TL;DR
This paper introduces a knowledge distillation method to create a lightweight 3D vision-language model that maintains strong spatial reasoning while significantly reducing computational costs.
Contribution
It presents a novel distillation framework with Hidden CoT for improved reasoning in a compact 3D VLM, achieving high performance with lower latency.
Findings
Achieves 8.7x lower inference latency and 3x smaller size while retaining 54-72% of teacher performance.
Introduces Hidden CoT, a latent scratchpad for reasoning without chain-of-thought data.
Reaches 68-72% accuracy on spatial reasoning tasks on ScanNet and 3D-FRONT.
Abstract
Large-scale 3D vision-language models (VLMs) like LLaVA-3D offer strong spatial reasoning but are difficult to deploy due to high computational costs. We propose a knowledge distillation framework that transfers spatial reasoning from a 7B teacher to a 2.29B student model. Our approach achieves 8.7x lower inference latency and a 3x reduction in model size while retaining 54-72% of the teacher's performance. The framework utilizes VGGT as the vision encoder and a multi-task distillation pipeline with uncertainty-aware loss weighting. To improve reasoning without chain-of-thought (CoT) data, we introduce "Hidden CoT": learnable latent tokens that serve as an internal scratchpad before answer generation. This is the first use of latent scratchpad reasoning in distilled 3D VLMs. The student model jointly performs spatial description, depth estimation, and object detection. Experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
