Rethinking Causal Mask Attention for Vision-Language Inference
Xiaohuan Pei, Tao Huang, YanXiang Ma, Chang Xu

TL;DR
This paper investigates how different causal masking strategies impact vision-language inference and introduces a future-aware attention mechanism that improves context utilization while maintaining autoregressive structure.
Contribution
The authors propose a novel family of future-aware attentions that better leverage future visual context in vision-language models, addressing limitations of traditional rigid masking strategies.
Findings
Rigid masking hampers semantic context capture
Future-aware attention improves inference performance
Selective compression of future context benefits models
Abstract
Causal attention has become a foundational mechanism in autoregressive vision-language models (VLMs), unifying textual and visual inputs under a single generative framework. However, existing causal mask-based strategies are inherited from large language models (LLMs) where they are tailored for text-only decoding, and their adaptation to vision tokens is insufficiently addressed in the prefill stage. Strictly masking future positions for vision queries introduces overly rigid constraints, which hinder the model's ability to leverage future context that often contains essential semantic cues for accurate inference. In this work, we empirically investigate how different causal masking strategies affect vision-language inference and then propose a family of future-aware attentions tailored for this setting. We first empirically analyze the effect of previewing future tokens for vision…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper identifies a fundamental problem that the misalignment between text-oriented causal attention and the non-sequential nature of visual processing in VLMs. 2. The authors conduct a large-scale, systematic analysis of multiple future-aware causal masking strategies (M^f, M^{v2v}, M^{v2t}) across diverse multimodal benchmarks. The clear, task-dependent findings (e.g., M^{v2v} for visual reasoning, M^{v2t} for text-rich QA) provide concrete and interpretable insights. 3. The paper is w
1. It would be helpful for the authors to clarify how the proposed future-aware masking strategy differs from existing approaches that also leverage bidirectional attention or cross-attention in vision-language or multimodal models. Specifically, how does this method compare conceptually and practically to (1) prior works that implement fully bidirectional attention over visual tokens[1], or (2) models that achieve modality alignment primarily through cross-attention mechanisms[2]? A discussion
1. The paper offers a fresh and well-motivated perspective on how causal masking—originally designed for textual decoding—may be suboptimal for vision tokens. This conceptual rethinking addresses a fundamental assumption in current VLMs and opens up a new line of research on modality-aware causality. 2. The proposed light future-aware attention introduces future context compression without retraining or architectural changes, adding negligible latency while delivering consistent gains. 3. Expe
Actually I really like the insight this paper focuses on — questioning how VLMs can break free from the traditional causal attention inherited from language models. The idea is intuitively sound and genuinely interesting. However, given the paper’s current state, I cannot yet recommend acceptance. **I strongly encourage the authors to carefully revise the work, as it has great potential**. My main concerns are as follows: 1. The paper currently provides an investigation and an inference-only so
- The paper compellingly identifies a fundamental misalignment between the sequential, autoregressive nature of LLM-native causal masks and the more holistic, non-sequential nature of visual information processing. I like the motivation. - The authors systematically evaluate different masking strategies ($M^f$, $M^{v2v}$, $M^{v2t}$) and connect their benefits to specific categories of vision-language tasks (e.g., temporal reasoning, visual relation, text-rich QA), providing a nuanced understandi
- Lack the Ethics statement and Reproducibility statement in the main text. - The paper demonstrates that different future-aware masks (e.g., $M^{v2v}$ vs. $M^{v2t}$) are optimal for different tasks. This raises a practical question: how would a single general model choose the correct mask without a priori knowledge of the downstream task? The paper does not propose a dynamic or learned mechanism for this selection. - The experiments are conducted by modifying the LLaVA. It doesn't explore how t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsSoftmax · Attention Is All You Need
