OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

Sheng-Yu Huang; Jaesung Choe; Yu-Chiang Frank Wang; Cheng Sun

arXiv:2601.09575·cs.CV·January 15, 2026

OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

Sheng-Yu Huang, Jaesung Choe, Yu-Chiang Frank Wang, Cheng Sun

PDF

Open Access

TL;DR

OpenVoxel is a training-free method that groups and captions voxels in 3D scenes using vision-language models, enabling open-vocabulary scene understanding without additional training or embeddings.

Contribution

It introduces a novel training-free approach for voxel grouping and captioning in 3D scenes, leveraging existing vision-language models for open-vocabulary tasks.

Findings

01

Outperforms recent methods in complex referring expression segmentation

02

Does not require training or embeddings from CLIP/BERT

03

Effectively builds scene maps with meaningful object groups

Abstract

We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Hand Gesture Recognition Systems