# ImVoxelGNet: Image to voxels geometry-aware projection for multi-view RGB-based 3D object detection

**Authors:** Gang Xu, Biao Leng, Zhang Xiong

PMC · DOI: 10.1371/journal.pone.0320589 · 2025-05-19

## TL;DR

ImVoxelGNet improves 3D object detection by better capturing geometric relationships from multiple image views.

## Contribution

A novel framework that enhances geometric perception by integrating pixel features more effectively during voxel projection.

## Key findings

- ImVoxelGNet improves 3D object detection performance by 2.2% in mean average precision on the ScanNetV2 dataset.
- The framework's integration of pixel features leads to more accurate spatial geometric learning.
- An implicit geometric perception structure refines spatial features and improves occupancy relationships in voxels.

## Abstract

3D object detection based solely on image data presents a significant challenge in computer vision, primarily due to the need to integrate geometric perception processes derived from visual inputs. The key to overcoming this challenge lies in effectively capturing the geometric relationships across multiple viewpoints, thereby establishing strong geometric priors. Current methods commonly back-project voxels onto images to align voxel-pixel features, yet during this process, pixel features are insufficiently involved in learning, leading to a decrease in geometric perception accuracy and, consequently, impacting detection performance. To address this limitation, we propose a novel network framework called ImVoxelGNet. This framework first integrates features projected onto pixels via a expansion operation, compensating for the pixel information inadequately utilized in traditional back-projection methods, thus enabling more precise learning of spatial geometric features. Additionally, we design an implicit geometric perception structure that further refines the spatial geometric features obtained after integrating image features, learning the occupancy relationships in spatial voxels and updating them within the spatial features. Finally, we generate the final prediction results by combining a detection head with 3D convolutions. Evaluation on the ScanNetV2 multi-view 3D object detection dataset demonstrates that ImVoxelGNet achieves a performance improvement of up to 2.2% in mean average precision (mAP). This improvement effectively demonstrates the efficacy of our method in significantly enhancing 3D object detection performance through improved geometric perception and comprehensive scene understanding. Codes and data are released in https://github.com/xug-coder/ImVoxelGNet.

## Full-text entities

- **Genes:** ELF2 (E74 like ETS transcription factor 2) [NCBI Gene 1998] {aka EU32, NERF, NERF-1A, NERF-1B, NERF-1a,b, NERF-2}, GIP (gastric inhibitory polypeptide) [NCBI Gene 2695]
- **Chemicals:** Rgb (-)

## Figures

43 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12088028/full.md

---
Source: https://tomesphere.com/paper/PMC12088028