Skipping Computations in Multimodal LLMs
Mustafa Shukor, Matthieu Cord

TL;DR
This paper investigates computational redundancy in Multimodal Large Language Models during inference, proposing methods to skip or parallelize layers, significantly reducing computation costs while maintaining performance.
Contribution
The study introduces techniques to skip and parallelize computations in MLLMs, demonstrating substantial efficiency gains without performance loss.
Findings
Significant computation can be avoided during inference, especially for VQA tasks.
Skipping during training recovers 97% of original performance even with substantial layer skipping.
Training smaller models can achieve performance comparable to larger models.
Abstract
Large Language Models (LLMs) have demonstrated remarkable success in both textual and multimodal domains. However, this success often comes with substantial computational costs, particularly when handling lengthy sequences of multimodal inputs. This has sparked many efforts focusing on enhancing efficiency during training and inference. In this study, we investigate the computation redundancy in Multimodal Large Language Models (MLLMs) during inference. We propose different methods to skip computations, such as skipping entire blocks, FFN or self-attention (SA) layers. Additionally, we explore parallelizing certain layers, such as FFN and SA layers. Our findings validate that (1) significant amount of computations can be avoided at inference time, especially for tasks such as Visual Question Answering (VQA). (2) Skipping computations during training can recover 97% of the original…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
