Work-Efficient Parallel Non-Maximum Suppression Kernels

David Oro; Carles Fern\'andez; Xavier Martorell; Javier Hernando

arXiv:2502.00535·cs.CV·February 4, 2025

Work-Efficient Parallel Non-Maximum Suppression Kernels

David Oro, Carles Fern\'andez, Xavier Martorell, Javier Hernando

PDF

1 Repo

TL;DR

This paper introduces a highly scalable parallel NMS algorithm optimized for embedded GPUs, significantly accelerating object detection post-processing by clustering thousands of detections in milliseconds.

Contribution

The paper presents a novel GPU-optimized parallel NMS kernel capable of handling thousands of detections efficiently, outperforming existing methods in speed and scalability.

Findings

01

Clustering 1024 detections in ~1 ms on NVIDIA Tegra X1 and X2 GPUs.

02

Achieves 14x-40x speedup over state-of-the-art learned NMS methods.

03

Applicable to various sequential NMS algorithms like Soft-NMS and FeatureNMS.

Abstract

In the context of object detection, sliding-window classifiers and single-shot Convolutional Neural Network (CNN) meta-architectures typically yield multiple overlapping candidate windows with similar high scores around the true location of a particular object. Non-Maximum Suppression (NMS) is the process of selecting a single representative candidate within this cluster of detections, so as to obtain a unique detection per object appearing on a given picture. In this paper, we present a highly scalable NMS algorithm for embedded GPU architectures that is designed from scratch to handle workloads featuring thousands of simultaneous detections on a given picture. Our kernels are directly applicable to other sequential NMS algorithms such as FeatureNMS, Soft-NMS or AdaptiveNMS that share the inner workings of the classic greedy NMS method. The obtained performance results show that our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hertasecurity/gpu-nms
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFeatureNMS · Soft-NMS · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings