LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
Yunze Man, Shihao Wang, Guowen Zhang, Johan Bjorck, Zhiqi Li, Liang-Yan Gui, Jim Fan, Jan Kautz, Yu-Xiong Wang, Zhiding Yu

TL;DR
LocateAnything3D introduces a vision-language model that performs multi-object 3D detection by predicting sequences of tokens, mimicking human reasoning, and achieves state-of-the-art results on the Omni3D benchmark.
Contribution
It presents a novel chain-of-sight sequence approach that enables VLMs to perform 3D detection without specialized heads, improving open-vocabulary and zero-shot capabilities.
Findings
Achieves 38.90 AP_3D on Omni3D, surpassing previous best by +13.98
Generalizes zero-shot to unseen categories with robustness
Uses a sequence prediction approach mirroring human reasoning in 3D detection
Abstract
To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robot Manipulation and Learning
