Empowering In-Browser Deep Learning Inference on Edge Devices with   Just-in-Time Kernel Optimizations

Fucheng Jia; Shiqi Jiang; Ting Cao; Wei Cui; Tianrui Xia; Xu Cao,; Yuanchun Li; Deyu Zhang; Ju Ren; Yunxin Liu; Lili Qiu; Mao Yang

arXiv:2309.08978·cs.AI·July 9, 2024

Empowering In-Browser Deep Learning Inference on Edge Devices with Just-in-Time Kernel Optimizations

Fucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Tianrui Xia, Xu Cao,, Yuanchun Li, Deyu Zhang, Ju Ren, Yunxin Liu, Lili Qiu, Mao Yang

PDF

Open Access

TL;DR

This paper introduces nnJIT, a system for in-browser deep learning inference on edge devices that uses just-in-time kernel optimization to significantly improve performance and reduce compilation overhead across diverse hardware.

Contribution

The paper presents nnJIT, a novel in-browser inference system with two key techniques that enable fast, optimized kernel generation tailored for Web and edge device constraints.

Findings

01

Achieves up to 8.2X faster inference performance.

02

Reduces kernel compilation costs by 100X.

03

Supports diverse models and hardware platforms.

Abstract

Web is increasingly becoming the primary platform to deliver AI services onto edge devices, making in-browser deep learning (DL) inference more prominent. Nevertheless, the heterogeneity of edge devices, combined with the underdeveloped state of Web hardware acceleration practices, hinders current in-browser inference from achieving its full performance potential on target devices. To address this issue, this paper presents the pioneering inbrowser inference system, nnJIT, which enables just-in-time (JIT) auto-generation of optimized computing kernels for edge devices. nnJIT is built upon two novel techniques that significantly reduce kernel search and compilation overhead while improving performance firmly: Tensor-Web Compiling Co-Design lowers compiling costs by around 100X through eliminating redundant and ineffective compiling passes; Web-Specific Lite Kernel Optimization Space…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Tensor decomposition and applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Multi-Head Attention · Residual Connection · Softmax · SentencePiece · Byte Pair Encoding · Layer Normalization · Gated Linear Unit