TL;DR
This paper introduces a highly scalable parallel NMS algorithm optimized for embedded GPUs, significantly accelerating object detection post-processing by clustering thousands of detections in milliseconds.
Contribution
The paper presents a novel GPU-optimized parallel NMS kernel capable of handling thousands of detections efficiently, outperforming existing methods in speed and scalability.
Findings
Clustering 1024 detections in ~1 ms on NVIDIA Tegra X1 and X2 GPUs.
Achieves 14x-40x speedup over state-of-the-art learned NMS methods.
Applicable to various sequential NMS algorithms like Soft-NMS and FeatureNMS.
Abstract
In the context of object detection, sliding-window classifiers and single-shot Convolutional Neural Network (CNN) meta-architectures typically yield multiple overlapping candidate windows with similar high scores around the true location of a particular object. Non-Maximum Suppression (NMS) is the process of selecting a single representative candidate within this cluster of detections, so as to obtain a unique detection per object appearing on a given picture. In this paper, we present a highly scalable NMS algorithm for embedded GPU architectures that is designed from scratch to handle workloads featuring thousands of simultaneous detections on a given picture. Our kernels are directly applicable to other sequential NMS algorithms such as FeatureNMS, Soft-NMS or AdaptiveNMS that share the inner workings of the classic greedy NMS method. The obtained performance results show that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFeatureNMS · Soft-NMS · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
