You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model
Shengkun Tang, Yaqing Wang, Zhenglun Kong, Tianchi Zhang, Yao Li,, Caiwen Ding, Yanzhi Wang, Yi Liang, Dongkuan Xu

TL;DR
This paper introduces MuE, a novel dynamic early exiting strategy for unified vision-language models that adaptively skips encoder and decoder layers based on input complexity, significantly improving inference efficiency without sacrificing much accuracy.
Contribution
The paper proposes MuE, a flexible early exiting method for both encoder and decoder in unified vision-language models, enabling substantial inference speedup while maintaining high performance.
Findings
Reduces inference time by up to 50% on SNLI-VE and 40% on MS COCO.
Maintains 99% and 96% of original performance on SNLI-VE and MS COCO.
Demonstrates effectiveness of modality-specific layer skipping in unified models.
Abstract
Large-scale Transformer models bring significant improvements for various downstream vision language tasks with a unified architecture. The performance improvements come with increasing model size, resulting in slow inference speed and increased cost for severing. While some certain predictions benefit from the full complexity of the large-scale model, not all of inputs need the same amount of computation to conduct, potentially leading to computation resource waste. To handle this challenge, early exiting is proposed to adaptively allocate computational power in term of input complexity to improve inference efficiency. The existing early exiting strategies usually adopt output confidence based on intermediate layers as a proxy of input complexity to incur the decision of skipping following layers. However, such strategies cannot apply to encoder in the widely-used unified architecture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Layer Normalization · Adam · Linear Layer · Dense Connections · Residual Connection · Byte Pair Encoding · Position-Wise Feed-Forward Layer
