Geo-ConvGRU: Geographically Masked Convolutional Gated Recurrent Unit   for Bird-Eye View Segmentation

Guanglei Yang; Yongqiang Zhang; Wanlong Li; Yu Tang; Weize Shang; Feng; Wen; Hongbo Zhang; Mingli Ding

arXiv:2412.20171·cs.CV·December 31, 2024

Geo-ConvGRU: Geographically Masked Convolutional Gated Recurrent Unit for Bird-Eye View Segmentation

Guanglei Yang, Yongqiang Zhang, Wanlong Li, Yu Tang, Weize Shang, Feng, Wen, Hongbo Zhang, Mingli Ding

PDF

Open Access

TL;DR

This paper introduces Geo-ConvGRU, a novel module combining geographical masking with convolutional gated recurrent units, to improve temporal dependency modeling in Bird's-Eye View segmentation, achieving state-of-the-art results.

Contribution

The paper proposes Geo-ConvGRU, replacing 3D CNNs with ConvGRU and adding geographical masking, to enhance temporal modeling in Bird's-Eye View segmentation tasks.

Findings

01

Achieves state-of-the-art performance on NuScenes dataset.

02

Effectively models long-range temporal dependencies.

03

Reduces noise in temporal features.

Abstract

Convolutional Neural Networks (CNNs) have significantly impacted various computer vision tasks, however, they inherently struggle to model long-range dependencies explicitly due to the localized nature of convolution operations. Although Transformers have addressed limitations in long-range dependencies for the spatial dimension, the temporal dimension remains underexplored. In this paper, we first highlight that 3D CNNs exhibit limitations in capturing long-range temporal dependencies. Though Transformers mitigate spatial dimension issues, they result in a considerable increase in parameter and processing speed reduction. To overcome these challenges, we introduce a simple yet effective module, Geographically Masked Convolutional Gated Recurrent Unit (Geo-ConvGRU), tailored for Bird's-Eye View segmentation. Specifically, we substitute the 3D CNN layers with ConvGRU in the temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Image Retrieval and Classification Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Convolution · 3 Dimensional Convolutional Neural Network