MPTF-Net: Multi-view Pyramid Transformer Fusion Network for LiDAR-based Place Recognition
Shuyuan Li, Zihang Wang, Xieyuanli Chen, Wenkai Zhu, Xiaoteng Fang, Peizhou Ni, Junhao Yang, and Dong Kong

TL;DR
MPTF-Net introduces a multi-view pyramid Transformer fusion network utilizing NDT-based BEV encoding for improved LiDAR place recognition, achieving state-of-the-art results with real-time inference.
Contribution
The paper presents a novel multi-view multi-scale pyramid Transformer that explicitly models local geometric structures using NDT-based BEV encoding for enhanced place recognition.
Findings
Achieves 96.31% Recall@1 on nuScenes Boston split.
Maintains real-time inference latency of 10.02 ms.
Outperforms existing methods on multiple datasets.
Abstract
LiDAR-based place recognition (LPR) is essential for global localization and loop-closure detection in large-scale SLAM systems. Existing methods typically construct global descriptors from Range Images or BEV representations for matching. BEV is widely adopted due to its explicit 2D spatial layout encoding and efficient retrieval. However, conventional BEV representations rely on simple statistical aggregation, which fails to capture fine-grained geometric structures, leading to performance degradation in complex or repetitive environments. To address this, we propose MPTF-Net, a novel multi-view multi-scale pyramid Transformer fusion network. Our core contribution is a multi-channel NDT-based BEV encoding that explicitly models local geometric complexity and intensity distributions via Normal Distribution Transform, providing a noise-resilient structural prior. To effectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
