PEVLM: Parallel Encoding for Vision-Language Models
Letian Kang, Shixian Luo, Yiqiang Li, Yuxin Yin, Shenxuan Zhou, Xiaoyang Yu, Jin Yang, Yong Wu

TL;DR
PEVLM is a parallel encoding method that significantly speeds up attention computation in vision-language models for long videos, reducing latency while maintaining or improving accuracy.
Contribution
The paper introduces PEVLM, a novel parallel encoding approach that reduces attention complexity in VLMs for long videos without sacrificing accuracy.
Findings
Achieves up to 7.47x speedup in attention computation.
Reduces end-to-end latency by 40%.
Maintains or surpasses full-attention accuracy in various benchmarks.
Abstract
Vision-Language Models (VLMs) have demonstrated strong capabilities in multimodal understanding and generation tasks. However, their application to long video understanding remains hindered by the quadratic complexity of standard attention mechanisms. In this work, we introduce \textbf{PEVLM}, a fine-tuning-free parallel encoding method designed to enhance the prefilling efficiency of VLMs in long video scenarios. PEVLM partitions the input video into context blocks with a shared sink block, while preserving sequential position embeddings to align the attention weight distribution with that of Full-Attention. This design reduces attention complexity from to where is the number of frames and the number of tokens per frame, without sacrificing accuracy. Extensive experiments across multiple state-of-the-art models and benchmarks demonstrate that…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Solid implementation and reproducibility: The method is implemented within a production-grade serving framework (SGLang) and evaluated with well-specified setups, which enhances reproducibility. 2. Lightweight and deployment-friendly: PEVLM introduces no extra parameters or fine-tuning, relying solely on structural reorganization of attention computation, making it easily applicable to existing VLM pipelines. 3. Well-written and structured: The paper’s organization and figures make the meth
1. Logic flow needs more clarification: The introduction states, “These application scenarios often demand processing longer video inputs.” While longer video inputs are indeed an important problem to address, the paper later mentions, “Although PEVLM achieves significant acceleration of inference, it is similarly limited to 256-frame inputs. Nonetheless, this trade-off is acceptable as our primary goal is to enhance inference efficiency rather than to expand context length.” This creates a conc
1. The manuscript is clearly written and provides a well-articulated diagnosis of the failure modes in current parallel encoding strategies for long-video understanding. The proposed method is strongly motivated by this analysis, which enhances the interpretability and credibility of the design. 2. The experimental results are comprehensive and persuasive, showing that PEVLM achieves comparable or superior accuracy to full attention across multiple VLMs and long-video benchmarks, while also red
1. The proposed method is limited to the prefill stage, which may restrict its applicability in broader long-video understanding scenarios, such as real-time or streaming video processing where continuous updates and causal inference are required. 2. The explanation for PEVLM’s superior performance over full attention in very long contexts—namely, that block-wise softmax mitigates degradation—is largely intuitive. The current analysis does not fully account for model-specific differences observ
1. The paper is well-written, with the methodology and experimental results presented in a clear and systematic manner. 2. The research topic is highly practical, as introducing a training-free method that effectively reduces inference memory usage and latency provides significant value for real-world model deployment. 3. The experiments in the paper highlight the effectiveness of PEVLM, achieving a 40% reduction in latency while maintaining accuracy on several long-video benchmarks.
1. The experimental evaluation is incomplete. While PEVLM is designed for long-video understanding, all the benchmarks focus solely on QA tasks. Token sparsification typically has limited impact on QA tasks; however, for tasks that rely on fine-grained visual details, such as video captioning or video OCR, it could introduce significant drawbacks. The authors should include results on such benchmarks to better demonstrate the method’s versatility across different task types. 2. The paper briefly
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques
