PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning

Dongxu Zhang; Yiding Sun; Pengcheng Li; Yumou Liu; Hongqiang Lin; Haoran Xu; Xiaoxuan Mu; Liang Lin; Wenbiao Yan; Ning Yang; Chaowei Fang; Juanjuan Zhao; Jihua Zhu; Conghui He; Cheng Tan

arXiv:2602.23945·cs.CV·March 2, 2026

PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning

Dongxu Zhang, Yiding Sun, Pengcheng Li, Yumou Liu, Hongqiang Lin, Haoran Xu, Xiaoxuan Mu, Liang Lin, Wenbiao Yan, Ning Yang, Chaowei Fang, Juanjuan Zhao, Jihua Zhu, Conghui He, Cheng Tan

PDF

Open Access

TL;DR

PointCoT introduces an explicit reasoning framework for 3D point cloud understanding in multimodal models, significantly improving geometric reasoning accuracy by generating structured rationales before final answers.

Contribution

The paper presents PointCoT, a novel multi-modal framework with explicit Chain-of-Thought reasoning and a large-scale benchmark for 3D geometric understanding.

Findings

01

Achieves state-of-the-art results on complex 3D reasoning tasks.

02

Demonstrates the effectiveness of explicit reasoning over implicit methods.

03

Constructed a large-scale benchmark with hierarchical CoT annotations.

Abstract

While Multimodal Large Language Models (MLLMs) demonstrate proficiency in 2D scenes, extending their perceptual intelligence to 3D point cloud understanding remains a significant challenge. Current approaches focus primarily on aligning 3D features with pre-trained models. However, they typically treat geometric reasoning as an implicit mapping process. These methods bypass intermediate logical steps and consequently suffer from geometric hallucinations. They confidently generate plausible responses that fail to ground in precise structural details. To bridge this gap, we present PointCoT, a novel framework that empowers MLLMs with explicit Chain-of-Thought (CoT) reasoning for 3D data. We advocate for a \textit{Look, Think, then Answer} paradigm. In this approach, the model is supervised to generate geometry-grounded rationales before predicting final answers. To facilitate this, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Multimodal Machine Learning Applications · Robotics and Sensor-Based Localization