# Model Slicing for Supporting Complex Analytics with Elastic Inference   Cost and Resource Constraints

**Authors:** Shaofeng Cai, Gang Chen, Beng Chin Ooi, Jinyang Gao

arXiv: 1904.01831 · 2021-04-22

## TL;DR

This paper introduces model slicing, a training scheme enabling deep learning models to dynamically adjust their computational complexity during inference, supporting elastic resource usage under cost and resource constraints.

## Contribution

The paper proposes a novel model slicing technique that allows deep models to provide predictions within a specified resource budget without additional computational resources.

## Key findings

- Supports on-demand workload with elastic inference cost
- Enables dynamic adjustment of model complexity during inference
- Maintains high accuracy across different resource configurations

## Abstract

Deep learning models have been used to support analytics beyond simple aggregation, where deeper and wider models have been shown to yield great results. These models consume a huge amount of memory and computational operations. However, most of the large-scale industrial applications are often computational budget constrained. In practice, the peak workload of inference service could be 10x higher than the average cases, with the presence of unpredictable extreme cases. Lots of computational resources could be wasted during off-peak hours and the system may crash when the workload exceeds system capacity. How to support deep learning services with a dynamic workload cost-efficiently remains a challenging problem. In this paper, we address the challenge with a general and novel training scheme called model slicing, which enables deep learning models to provide predictions within the prescribed computational resource budget dynamically. Model slicing could be viewed as an elastic computation solution without requiring more computational resources. Succinctly, each layer in the model is divided into groups of a contiguous block of basic components (i.e. neurons in dense layers and channels in convolutional layers), and then partially ordered relation is introduced to these groups by enforcing that groups participated in each forward pass always starts from the first group to the dynamically-determined rightmost group. Trained by dynamically indexing the rightmost group with a single parameter slice rate, the network is engendered to build up group-wise and residual representation. Then during inference, a sub-model with fewer groups can be readily deployed for efficiency whose computation is roughly quadratic to the width controlled by the slice rate. Extensive experiments show that models trained with model slicing can effectively support on-demand workload with elastic inference cost.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.01831/full.md

## Figures

15 figures with captions in the complete paper: https://tomesphere.com/paper/1904.01831/full.md

## References

58 references — full list in the complete paper: https://tomesphere.com/paper/1904.01831/full.md

---
Source: https://tomesphere.com/paper/1904.01831