Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought
Yu Huo, Siyu Zhang, Kun Zeng, Haoyue Liu, Owen Lee, Junlin Chen, Yuquan Lu, Yifu Guo, Yaodong Liang, Xiaoying Tang

TL;DR
Shape-of-Thought introduces a visual chain-of-thought framework enabling progressive shape assembly in multimodal models, improving compositional reasoning without explicit geometric representations.
Contribution
It proposes a novel visual chain-of-thought method with a new dataset and benchmark, enhancing shape assembly and compositional reasoning in multimodal models.
Findings
Achieved 88.4% component numeracy accuracy
Reached 84.8% structural topology accuracy
Outperformed text-only baselines by around 20%
Abstract
Multimodal models for text-to-image generation have achieved strong visual fidelity, yet they remain brittle under compositional structural constraints-notably generative numeracy, attribute binding, and part-level relations. To address these challenges, we propose Shape-of-Thought (SoT), a visual CoT framework that enables progressive shape assembly via coherent 2D projections without external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations. To support this paradigm, we introduce SoT-26K, a large-scale dataset of grounded assembly traces derived from part-based CAD hierarchies, and T2S-CompBench, a benchmark for evaluating structural integrity and trace faithfulness. Fine-tuning on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
