Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT

Alaa Asfour; Christopher Indris; Leihan Chen; Tejas Vyas; Guanghui Wang

arXiv:2605.09719·cs.CV·May 12, 2026

Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT

Alaa Asfour, Christopher Indris, Leihan Chen, Tejas Vyas, Guanghui Wang

PDF

TL;DR

This paper introduces a knowledge distillation method to create a lightweight 3D vision-language model that maintains strong spatial reasoning while significantly reducing computational costs.

Contribution

It presents a novel distillation framework with Hidden CoT for improved reasoning in a compact 3D VLM, achieving high performance with lower latency.

Findings

01

Achieves 8.7x lower inference latency and 3x smaller size while retaining 54-72% of teacher performance.

02

Introduces Hidden CoT, a latent scratchpad for reasoning without chain-of-thought data.

03

Reaches 68-72% accuracy on spatial reasoning tasks on ScanNet and 3D-FRONT.

Abstract

Large-scale 3D vision-language models (VLMs) like LLaVA-3D offer strong spatial reasoning but are difficult to deploy due to high computational costs. We propose a knowledge distillation framework that transfers spatial reasoning from a 7B teacher to a 2.29B student model. Our approach achieves 8.7x lower inference latency and a 3x reduction in model size while retaining 54-72% of the teacher's performance. The framework utilizes VGGT as the vision encoder and a multi-task distillation pipeline with uncertainty-aware loss weighting. To improve reasoning without chain-of-thought (CoT) data, we introduce "Hidden CoT": learnable latent tokens that serve as an internal scratchpad before answer generation. This is the first use of latent scratchpad reasoning in distilled 3D VLMs. The student model jointly performs spatial description, depth estimation, and object detection. Experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.