NeuroVoxel-LM: Language-Aligned 3D Perception via Dynamic Voxelization and Meta-Embedding
Shiyu Liu, Lianlei Shan

TL;DR
NeuroVoxel-LM introduces a novel framework combining dynamic voxelization and meta-embedding to enhance language-aligned 3D perception from large-scale point clouds, improving efficiency and semantic accuracy.
Contribution
The paper presents NeuroVoxel-LM, integrating adaptive voxelization and lightweight meta-embedding to address limitations in existing 3D language models.
Findings
DR-MSV improves feature extraction efficiency and accuracy
TAP-LME enhances semantic representation over max-pooling
Framework outperforms existing methods in 3D perception tasks
Abstract
Recent breakthroughs in Visual Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have significantly advanced 3D scene perception towards language-driven cognition. However, existing 3D language models struggle with sparse, large-scale point clouds due to slow feature extraction and limited representation accuracy. To address these challenges, we propose NeuroVoxel-LM, a novel framework that integrates Neural Radiance Fields (NeRF) with dynamic resolution voxelization and lightweight meta-embedding. Specifically, we introduce a Dynamic Resolution Multiscale Voxelization (DR-MSV) technique that adaptively adjusts voxel granularity based on geometric and structural complexity, reducing computational cost while preserving reconstruction fidelity. In addition, we propose the Token-level Adaptive Pooling for Lightweight Meta-Embedding (TAP-LME) mechanism, which enhances…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
