CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
Shilin Yan, Jiaming Han, Joey Tsai, Hongwei Xue, Rongyao Fang, Lingyi Hong, Ziyu Guo, Ray Zhang

TL;DR
CrossLMM introduces a dual cross-attention mechanism to efficiently process long video sequences in large multimodal models, significantly reducing computational costs while maintaining high performance.
Contribution
It proposes a novel dual cross-attention framework that decouples long video sequences from LMMs, enabling efficient token reduction and improved multimodal understanding.
Findings
Achieves comparable or better performance with fewer computational resources.
Reduces visual tokens significantly through pooling and cross-attention.
Maintains fine-grained informational fidelity in long video processing.
Abstract
The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video). However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs. This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge. In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The proposed dual attention mechanism makes sense, differs from earlier attempts that try to compress completely before the first LLM layer, and gives good results in practice. I've always wondered why the visual tokens were treated the exact same way as the textual tokens. Now that seems corrected. The method has been tested on diverse video benchmarks, including videos of various length.
1. The math doesn't add up in the section on 'initial token merge', where the use of bilinear pooling is said to reduce the dimensionality by a factor of 9, consistent with a 3x3 local patch aggregation, but then it's said this brings them from N=729 to N=9. I thought this was a typo, but later in the experiments the authors keep mentioning sizes of 1 / 9 / 16, so now I'm confused and don't know what the authors are actually doing in this phase. 2. One would expect that a more compact handlin
* The paper is clearly written and easy to follow. * The idea is intuitive and well-presented. * Enhancing the alignment between visual and textual modalities after token reduction is a promising direction worthy of further exploration.
* While the use of both visual-to-visual and textual-to-visual cross-attention is intuitive and known to improve performance, it is not novel. Also the primary efficiency gain of CrossLMM appears to stem from token pooling, which drastically reduces the number of visual tokens, not from the added attention modules. * Given that the core architectural idea is well established, a more rigorous experimental evaluation is expected. Yet, I believe the comparison between CrossLMM and the baselines is
- The proposed dual cross-attention mechanism is a well-motivated design, as it aims to compress visual information by considering both the global visual context (V2V) and the guiding semantic information from the text (T2V), which could lead to more informed and effective token selection. - The paper is well-written and easy to follow.
1. The core contribution of the paper is a dual cross-attention module that incorporates both global visual and textual information when compressing visual tokens. However, the idea of using cross-attention for visual token compression is not novel and has been explored in prior works, such as [1 2 3]. [1] RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words [2] Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models [3] LLAVA-M
1. The proposed approach is clearly explained. 2. It delivers performance on par with existing methods using much fewer visual tokens. 3. The method’s effectiveness is demonstrated across several benchmark datasets.
1. The proposed CrossLLM framework offers limited novelty, as the idea of a cross-attention-based language model have been widely explored in previous works. The paper does not clearly explain how CrossLLM provides a significant extension or meaningful differentiation from previous works. 2. The token compression is applied only within individual frames, which may overlook a key characteristic of the video modality— the redundancy of information across frames. It would be interesting to explore
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
