CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

Shilin Yan; Jiaming Han; Joey Tsai; Hongwei Xue; Rongyao Fang; Lingyi Hong; Ziyu Guo; Ray Zhang

arXiv:2505.17020·cs.CV·December 23, 2025

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

Shilin Yan, Jiaming Han, Joey Tsai, Hongwei Xue, Rongyao Fang, Lingyi Hong, Ziyu Guo, Ray Zhang

PDF

Open Access 1 Repo 4 Reviews

TL;DR

CrossLMM introduces a dual cross-attention mechanism to efficiently process long video sequences in large multimodal models, significantly reducing computational costs while maintaining high performance.

Contribution

It proposes a novel dual cross-attention framework that decouples long video sequences from LMMs, enabling efficient token reduction and improved multimodal understanding.

Findings

01

Achieves comparable or better performance with fewer computational resources.

02

Reduces visual tokens significantly through pooling and cross-attention.

03

Maintains fine-grained informational fidelity in long video processing.

Abstract

The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video). However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs. This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge. In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

The proposed dual attention mechanism makes sense, differs from earlier attempts that try to compress completely before the first LLM layer, and gives good results in practice. I've always wondered why the visual tokens were treated the exact same way as the textual tokens. Now that seems corrected. The method has been tested on diverse video benchmarks, including videos of various length.

Weaknesses

1. The math doesn't add up in the section on 'initial token merge', where the use of bilinear pooling is said to reduce the dimensionality by a factor of 9, consistent with a 3x3 local patch aggregation, but then it's said this brings them from N=729 to N=9. I thought this was a typo, but later in the experiments the authors keep mentioning sizes of 1 / 9 / 16, so now I'm confused and don't know what the authors are actually doing in this phase. 2. One would expect that a more compact handlin

Reviewer 02Rating 2Confidence 4

Strengths

* The paper is clearly written and easy to follow. * The idea is intuitive and well-presented. * Enhancing the alignment between visual and textual modalities after token reduction is a promising direction worthy of further exploration.

Weaknesses

* While the use of both visual-to-visual and textual-to-visual cross-attention is intuitive and known to improve performance, it is not novel. Also the primary efficiency gain of CrossLMM appears to stem from token pooling, which drastically reduces the number of visual tokens, not from the added attention modules. * Given that the core architectural idea is well established, a more rigorous experimental evaluation is expected. Yet, I believe the comparison between CrossLMM and the baselines is

Reviewer 03Rating 4Confidence 4

Strengths

- The proposed dual cross-attention mechanism is a well-motivated design, as it aims to compress visual information by considering both the global visual context (V2V) and the guiding semantic information from the text (T2V), which could lead to more informed and effective token selection. - The paper is well-written and easy to follow.

Weaknesses

1. The core contribution of the paper is a dual cross-attention module that incorporates both global visual and textual information when compressing visual tokens. However, the idea of using cross-attention for visual token compression is not novel and has been explored in prior works, such as [1 2 3]. [1] RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words [2] Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models [3] LLAVA-M

Reviewer 04Rating 4Confidence 4

Strengths

1. The proposed approach is clearly explained. 2. It delivers performance on par with existing methods using much fewer visual tokens. 3. The method’s effectiveness is demonstrated across several benchmark datasets.

Weaknesses

1. The proposed CrossLLM framework offers limited novelty, as the idea of a cross-attention-based language model have been widely explored in previous works. The paper does not clearly explain how CrossLLM provides a significant extension or meaningful differentiation from previous works. 2. The token compression is applied only within individual frames, which may overlook a key characteristic of the video modality— the redundancy of information across frames. It would be interesting to explore

Code & Models

Repositories

shilinyan99/crosslmm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning