GaussianFusionOcc: A Seamless Sensor Fusion Approach for 3D Occupancy Prediction Using 3D Gaussians
Tomislav Pavkovi\'c, Mohammad-Ali Nikouei Mahani, Johannes Niedermayer, Johannes Betz

TL;DR
GaussianFusionOcc introduces a novel sensor fusion method using 3D Gaussians and deformable attention to improve 3D occupancy prediction in autonomous driving, offering better accuracy, scalability, and efficiency.
Contribution
It presents a new sensor fusion approach that employs semantic 3D Gaussians and deformable attention, enhancing prediction accuracy and computational efficiency over traditional dense grid methods.
Findings
Outperforms state-of-the-art models in accuracy
Demonstrates high scalability with multiple sensor types
Achieves faster inference with memory-efficient Gaussian representation
Abstract
3D semantic occupancy prediction is one of the crucial tasks of autonomous driving. It enables precise and safe interpretation and navigation in complex environments. Reliable predictions rely on effective sensor fusion, as different modalities can contain complementary information. Unlike conventional methods that depend on dense grid representations, our approach, GaussianFusionOcc, uses semantic 3D Gaussians alongside an innovative sensor fusion mechanism. Seamless integration of data from camera, LiDAR, and radar sensors enables more precise and scalable occupancy prediction, while 3D Gaussian representation significantly improves memory efficiency and inference speed. GaussianFusionOcc employs modality-agnostic deformable attention to extract essential features from each sensor type, which are then used to refine Gaussian properties, resulting in a more accurate representation of…
Peer Reviews
Decision·Submitted to ICLR 2026
1. While prior works like GaussianFormer demonstrated the efficiency of Gaussians, they were limited to single-modality inputs. This work creatively addresses that limitation. The core novel components—the "modality-agnostic Gaussian encoder" and the "seamless sensor fusion mechanism" —represent a new and logical combination of existing ideas to solve a clear and present problem. 2. The authors conduct extensive testing on the nuScenes dataset, comparing GaussianFusionOcc not just against one ca
1. The introduction of multi-modal actually is not a novel paradigm for occupancy prediction. And the pipeline of occupancy can be regarded as a multi-modal version of GaussianFormer, making the contribution of this paper fair. 2. In the main results (Table 1), the C+L model achieves 30.21 mIoU. The C+L+R model achieves 30.37 mIoU. This improvement of 0.16 mIoU is negligible and well within the range of training noise, suggesting the radar adds no meaningful information in the general case. The
The proposed method effectively addresses the challenge of multimodal fusion for Gaussian-based 3D occupancy prediction. The overall pipeline is simple yet efficient, achieving a good trade-off between performance and computational cost. Experimental results further demonstrate the effectiveness of the multimodal fusion strategy.
However, I also have some concerns of this paper: (1) Although the proposed method effectively tackles Gaussian-based 3D occupancy prediction under a multimodal setting, the approach itself is rather trivial. The fusion strategy and the use of deformable attention have already been widely adopted in 3D object detection. As a top-conference submission, the work does not provide sufficient conceptual depth or novel insight, so I consider it below the ICLR acceptance bar. (2) The writing of this pa
Overall, the paper is clearly written and the direction is reasonable. - i) The paper is well aligned with the ongoing shift and makes the natural next step: “what if we plug multi-sensor fusion into the Gaussian pipeline?” This is a reasonable research question. - ii) The model design is easy to read and to implement. - iii) The experiments cover several realistic sensor combinations
- i) The novelty over very close prior work is limited. In substance, the method looks like taking an existing Gaussian-based occupancy model, adding multi-sensor deformable attention in front of it, concatenating features, and keeping the existing splatting stage. - ii) The method relies on fairly strong per-sensor encoders. Camera, LiDAR and radar branches all reuse good backbones. Because of this, it is hard to tell whether the gains over the baselines actually come from the proposed Gaussi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Video Surveillance and Tracking Methods · 3D Surveying and Cultural Heritage
