LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

Yunze Man; Shihao Wang; Guowen Zhang; Johan Bjorck; Zhiqi Li; Liang-Yan Gui; Jim Fan; Jan Kautz; Yu-Xiong Wang; Zhiding Yu

arXiv:2511.20648·cs.CV·February 24, 2026

LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

Yunze Man, Shihao Wang, Guowen Zhang, Johan Bjorck, Zhiqi Li, Liang-Yan Gui, Jim Fan, Jan Kautz, Yu-Xiong Wang, Zhiding Yu

PDF

Open Access

TL;DR

LocateAnything3D introduces a vision-language model that performs multi-object 3D detection by predicting sequences of tokens, mimicking human reasoning, and achieves state-of-the-art results on the Omni3D benchmark.

Contribution

It presents a novel chain-of-sight sequence approach that enables VLMs to perform 3D detection without specialized heads, improving open-vocabulary and zero-shot capabilities.

Findings

01

Achieves 38.90 AP_3D on Omni3D, surpassing previous best by +13.98

02

Generalizes zero-shot to unseen categories with robustness

03

Uses a sequence prediction approach mirroring human reasoning in 3D detection

Abstract

To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robot Manipulation and Learning