UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio,   Video, Point Cloud, Time-Series and Image Recognition

Xiaohan Ding; Yiyuan Zhang; Yixiao Ge; Sijie Zhao; Lin Song; Xiangyu; Yue; Ying Shan

arXiv:2311.15599·cs.CV·March 19, 2024·34 cites

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu, Yue, Ying Shan

PDF

Open Access 3 Repos

TL;DR

UniRepLKNet introduces a universal large-kernel ConvNet architecture guided by four principles, achieving state-of-the-art results across diverse domains like vision, audio, and time-series without modality-specific modifications.

Contribution

The paper presents four architectural guidelines for large-kernel ConvNets and demonstrates their effectiveness across multiple modalities, establishing their universal perception capabilities.

Findings

01

Achieves 88.0% ImageNet accuracy

02

Sets new state-of-the-art in audio and time-series tasks

03

Outperforms recent competitors in speed and accuracy

Abstract

Large-kernel convolutional neural networks (ConvNets) have recently received extensive research attention, but two unresolved and critical issues demand further investigation. 1) The architectures of existing large-kernel ConvNets largely follow the design principles of conventional ConvNets or transformers, while the architectural design for large-kernel ConvNets remains under-addressed. 2) As transformers have dominated multiple modalities, it remains to be investigated whether ConvNets also have a strong universal perception ability in domains beyond vision. In this paper, we contribute from two aspects. 1) We propose four architectural guidelines for designing large-kernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep. Following such guidelines, our proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Music and Audio Processing · Human Pose and Action Recognition

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings