TSP3D: Text-guided Sparse Voxel Pruning for Efficient 3D Visual   Grounding

Wenxuan Guo; Xiuwei Xu; Ziwei Wang; Jianjiang Feng; Jie Zhou; Jiwen Lu

arXiv:2502.10392·cs.CV·March 12, 2025

TSP3D: Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding

Wenxuan Guo, Xiuwei Xu, Ziwei Wang, Jianjiang Feng, Jie Zhou, Jiwen Lu

PDF

Open Access 1 Repo 1 Models

TL;DR

TSP3D introduces a novel text-guided sparse voxel pruning method that significantly improves the speed and accuracy of 3D visual grounding by efficiently fusing scene and text features.

Contribution

The paper proposes TGP and CBA techniques for deep, efficient interaction between 3D scene representations and text features in a sparse convolutional framework.

Findings

01

Achieves top inference speed, doubling FPS over previous methods.

02

Surpasses previous state-of-the-art accuracy on multiple benchmarks.

03

Maintains negligible computational overhead with voxel completion.

Abstract

In this paper, we propose an efficient multi-level convolution architecture for 3D visual grounding. Conventional methods are difficult to meet the requirements of real-time inference due to the two-stage or point-based architecture. Inspired by the success of multi-level fully sparse convolutional architecture in 3D object detection, we aim to build a new 3D visual grounding framework following this technical route. However, as in 3D visual grounding task the 3D scene representation should be deeply interacted with text features, sparse convolution-based architecture is inefficient for this interaction due to the large amount of voxel features. To this end, we propose text-guided pruning (TGP) and completion-based addition (CBA) to deeply fuse 3D scene representation and text features in an efficient way by gradual region pruning and target completion. Specifically, TGP iteratively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gwxuan/tsp3d
pytorchOfficial

Models

🤗
gwx22/TSP3D
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization

MethodsConvolution · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Pruning