PEVLM: Parallel Encoding for Vision-Language Models

Letian Kang; Shixian Luo; Yiqiang Li; Yuxin Yin; Shenxuan Zhou; Xiaoyang Yu; Jin Yang; Yong Wu

arXiv:2506.19651·cs.CV·July 30, 2025

PEVLM: Parallel Encoding for Vision-Language Models

Letian Kang, Shixian Luo, Yiqiang Li, Yuxin Yin, Shenxuan Zhou, Xiaoyang Yu, Jin Yang, Yong Wu

PDF

Open Access 3 Reviews

TL;DR

PEVLM is a parallel encoding method that significantly speeds up attention computation in vision-language models for long videos, reducing latency while maintaining or improving accuracy.

Contribution

The paper introduces PEVLM, a novel parallel encoding approach that reduces attention complexity in VLMs for long videos without sacrificing accuracy.

Findings

01

Achieves up to 7.47x speedup in attention computation.

02

Reduces end-to-end latency by 40%.

03

Maintains or surpasses full-attention accuracy in various benchmarks.

Abstract

Vision-Language Models (VLMs) have demonstrated strong capabilities in multimodal understanding and generation tasks. However, their application to long video understanding remains hindered by the quadratic complexity of standard attention mechanisms. In this work, we introduce \textbf{PEVLM}, a fine-tuning-free parallel encoding method designed to enhance the prefilling efficiency of VLMs in long video scenarios. PEVLM partitions the input video into context blocks with a shared sink block, while preserving sequential position embeddings to align the attention weight distribution with that of Full-Attention. This design reduces attention complexity from $O ((T \times N)^{2})$ to $O (T \times N)$ where $T$ is the number of frames and $N$ the number of tokens per frame, without sacrificing accuracy. Extensive experiments across multiple state-of-the-art models and benchmarks demonstrate that…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. Solid implementation and reproducibility: The method is implemented within a production-grade serving framework (SGLang) and evaluated with well-specified setups, which enhances reproducibility. 2. Lightweight and deployment-friendly: PEVLM introduces no extra parameters or fine-tuning, relying solely on structural reorganization of attention computation, making it easily applicable to existing VLM pipelines. 3. Well-written and structured: The paper’s organization and figures make the meth

Weaknesses

1. Logic flow needs more clarification: The introduction states, “These application scenarios often demand processing longer video inputs.” While longer video inputs are indeed an important problem to address, the paper later mentions, “Although PEVLM achieves significant acceleration of inference, it is similarly limited to 256-frame inputs. Nonetheless, this trade-off is acceptable as our primary goal is to enhance inference efficiency rather than to expand context length.” This creates a conc

Reviewer 02Rating 6Confidence 4

Strengths

1. The manuscript is clearly written and provides a well-articulated diagnosis of the failure modes in current parallel encoding strategies for long-video understanding. The proposed method is strongly motivated by this analysis, which enhances the interpretability and credibility of the design. 2. The experimental results are comprehensive and persuasive, showing that PEVLM achieves comparable or superior accuracy to full attention across multiple VLMs and long-video benchmarks, while also red

Weaknesses

1. The proposed method is limited to the prefill stage, which may restrict its applicability in broader long-video understanding scenarios, such as real-time or streaming video processing where continuous updates and causal inference are required. 2. The explanation for PEVLM’s superior performance over full attention in very long contexts—namely, that block-wise softmax mitigates degradation—is largely intuitive. The current analysis does not fully account for model-specific differences observ

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper is well-written, with the methodology and experimental results presented in a clear and systematic manner. 2. The research topic is highly practical, as introducing a training-free method that effectively reduces inference memory usage and latency provides significant value for real-world model deployment. 3. The experiments in the paper highlight the effectiveness of PEVLM, achieving a 40% reduction in latency while maintaining accuracy on several long-video benchmarks.

Weaknesses

1. The experimental evaluation is incomplete. While PEVLM is designed for long-video understanding, all the benchmarks focus solely on QA tasks. Token sparsification typically has limited impact on QA tasks; however, for tasks that rely on fine-grained visual details, such as video captioning or video OCR, it could introduce significant drawbacks. The authors should include results on such benchmarks to better demonstrate the method’s versatility across different task types. 2. The paper briefly

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques