Empowering In-Browser Deep Learning Inference on Edge Devices with Just-in-Time Kernel Optimizations
Fucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Tianrui Xia, Xu Cao,, Yuanchun Li, Deyu Zhang, Ju Ren, Yunxin Liu, Lili Qiu, Mao Yang

TL;DR
This paper introduces nnJIT, a system for in-browser deep learning inference on edge devices that uses just-in-time kernel optimization to significantly improve performance and reduce compilation overhead across diverse hardware.
Contribution
The paper presents nnJIT, a novel in-browser inference system with two key techniques that enable fast, optimized kernel generation tailored for Web and edge device constraints.
Findings
Achieves up to 8.2X faster inference performance.
Reduces kernel compilation costs by 100X.
Supports diverse models and hardware platforms.
Abstract
Web is increasingly becoming the primary platform to deliver AI services onto edge devices, making in-browser deep learning (DL) inference more prominent. Nevertheless, the heterogeneity of edge devices, combined with the underdeveloped state of Web hardware acceleration practices, hinders current in-browser inference from achieving its full performance potential on target devices. To address this issue, this paper presents the pioneering inbrowser inference system, nnJIT, which enables just-in-time (JIT) auto-generation of optimized computing kernels for edge devices. nnJIT is built upon two novel techniques that significantly reduce kernel search and compilation overhead while improving performance firmly: Tensor-Web Compiling Co-Design lowers compiling costs by around 100X through eliminating redundant and ineffective compiling passes; Web-Specific Lite Kernel Optimization Space…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Tensor decomposition and applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Multi-Head Attention · Residual Connection · Softmax · SentencePiece · Byte Pair Encoding · Layer Normalization · Gated Linear Unit
