Why Should the Server Do It All?: A Scalable, Versatile, and Model-Agnostic Framework for Server-Light DNN Inference over Massively Distributed Clients via Training-Free Intermediate Feature Compression
Mingyu Sung, Suhwan Im, Daeho Bang, Il-Min Kim, Sangseok Yun, Jae-Mo Kang

TL;DR
SLICER is a retraining-free, model-agnostic framework that compresses intermediate features in distributed DNN inference, significantly reducing communication and server load while maintaining high task accuracy across vision and language models.
Contribution
It introduces a novel, training-free feature compression framework combining filtering, grouping, and adaptive quantization, enabling scalable, low-latency distributed inference without model retraining.
Findings
Up to 10x reduction in uplink communication volume.
Up to 4.4x decrease in server GPU time.
Maintains task accuracy within 0-3 percentage points of baseline.
Abstract
Modern DNNs often rely on edge-cloud model partitioning (MP), but widely used schemes fix shallow, static split points that underutilize edge compute and concentrate latency and energy on the server. The problem is exacerbated in autoregressive (AR) LLM inference, where per-token forward passes repeatedly generate bulky intermediate features (IFs). We introduce SLICER, a retraining-free, architecture-agnostic framework that compresses IFs to reduce both communication and server load in split computing. SLICER combines (i) asymmetric top-K filtering (ATKF) to sparsify low-magnitude activations, (ii) magnitude-splitting (MS) to group the remaining non-zeros into equal-cardinality blocks, and (iii) adaptive bit quantization (ABQ) that selects per-block bitwidths under a distortion budget. Across standard vision and LLM workloads (e.g., ImageNet/COCO; HellaSwag, PIQA, ARC-E/C, GSM8K,…
Peer Reviews
Decision·Submitted to ICLR 2026
+The framework seems that it training-free and plug-and-play. + The formulation selects (w,ℓ) and compression knobs θ to meet latency/memory caps and stabilize bits-per-token and server time/token—addressing a real bottleneck for multi-client LLM services. + Results span CNNs/transformers, LLM AR decoding, and diffusion; multi-device scaling shows server time reductions up to 4.4× and sustained backend throughput as clients grow; vision achieves state-of-the-art BPP for IFs with negligible accu
- The concrete edge device in this paper used is never mentioned in this paper. As such, the latency measurement is unfair since latency depends on the computation ability of the concrete device. - The evaluation largely abstracts the backend queue and uses a parametric wireless model; end-to-end wall-clock (with control-plane signaling for grid/codec metadata, CSR indices, and potential RANS entropy coding) is not dissected across diverse networks and straggler patterns. More real-network and
+ The method is validated on multiple domains, showing consistent performance improvements. + The authors have released the implementation, promoting transparency and facilitating future research.
- The paper does not compare SLICER against prior split-computing or communication-aware inference frameworks, such as BottleFit [1] and Frankensplit [2]. Including these baselines would better suit SLICER’s contributions. - The experiments use only a single client device (an NVIDIA Jetson AGX Orin) in the client–server pipeline. This setup is too limited to substantiate claims about scalability and generalization to multiple clients. - The hyperparameter selection process (e.g., sparsity level
1. It identifies a key problem in edge-cloud DNN inference and proposes a novel, training-free IF compression framework. 2. The proposed techniques (ATKF, MS, ABQ) are well-motivated and effectively reduce uplink volume and server load. 3. Scales in multi-device settings and is model-agnostic, making it versatile for various applications.
1. There is no analysis on the long token length setting for LLMs with SLICER, which maybe impact by the proposed ATKF method. 2. There is no device-side profiling for computation and memory overhead. 3. Some varibales seems arbitrary, such as the choice of equal block size in MS.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · IoT and Edge/Fog Computing · Domain Adaptation and Few-Shot Learning
