LinMU: Multimodal Understanding Made Linear
Hongjie Wang, Niraj K. Jha

TL;DR
LinMU introduces a linear-complexity multimodal understanding model that maintains high performance while significantly reducing computational costs, enabling efficient processing of high-resolution images and long videos.
Contribution
The paper proposes a novel M-MATE block and a three-stage distillation framework to transform pre-trained VLMs into linear-complexity models without sacrificing accuracy.
Findings
LinMU matches the performance of teacher models on multiple benchmarks.
Reduces Time-To-First-Token by up to 2.7 times.
Improves token throughput by up to 9.0 times on long videos.
Abstract
Modern Vision-Language Models (VLMs) achieve impressive performance but are limited by the quadratic complexity of self-attention, which prevents their deployment on edge devices and makes their understanding of high-resolution images and long-context videos prohibitively expensive. To address this challenge, we introduce LinMU (Linear-complexity Multimodal Understanding), a VLM design that achieves linear complexity for the language model decoder without using any quadratic-complexity modules while maintaining the performance of global-attention-based VLMs. LinMU replaces every self-attention layer in the language model decoder with an M-MATE block: a dual-branch module that combines a bidirectional state-space model for global context (Flex-MA branch) with localized Swin-style window attention (Local-Swin branch) for adjacent correlations. To transform a pre-trained VLM into the LinMU…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
