LinMU: Multimodal Understanding Made Linear

Hongjie Wang; Niraj K. Jha

arXiv:2601.01322·cs.CV·May 5, 2026

LinMU: Multimodal Understanding Made Linear

Hongjie Wang, Niraj K. Jha

PDF

TL;DR

LinMU introduces a linear-complexity multimodal understanding model that maintains high performance while significantly reducing computational costs, enabling efficient processing of high-resolution images and long videos.

Contribution

The paper proposes a novel M-MATE block and a three-stage distillation framework to transform pre-trained VLMs into linear-complexity models without sacrificing accuracy.

Findings

01

LinMU matches the performance of teacher models on multiple benchmarks.

02

Reduces Time-To-First-Token by up to 2.7 times.

03

Improves token throughput by up to 9.0 times on long videos.

Abstract

Modern Vision-Language Models (VLMs) achieve impressive performance but are limited by the quadratic complexity of self-attention, which prevents their deployment on edge devices and makes their understanding of high-resolution images and long-context videos prohibitively expensive. To address this challenge, we introduce LinMU (Linear-complexity Multimodal Understanding), a VLM design that achieves linear complexity for the language model decoder without using any quadratic-complexity modules while maintaining the performance of global-attention-based VLMs. LinMU replaces every self-attention layer in the language model decoder with an M-MATE block: a dual-branch module that combines a bidirectional state-space model for global context (Flex-MA branch) with localized Swin-style window attention (Local-Swin branch) for adjacent correlations. To transform a pre-trained VLM into the LinMU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.