Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Xialie Zhuang; Zhikai Jia; Jianjin Li; Zhenyu Zhang; Li Shen; Zheng Cao; Shiwei Liu

arXiv:2502.07490·cs.CL·March 16, 2026

Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Xialie Zhuang, Zhikai Jia, Jianjin Li, Zhenyu Zhang, Li Shen, Zheng Cao, Shiwei Liu

PDF

Open Access 1 Repo

TL;DR

MEAP is a training paradigm that integrates masked language modeling into autoregressive prediction, significantly improving large language models' ability to retrieve key information and perform long-context reasoning without extra computational costs.

Contribution

We introduce MEAP, a novel training method that combines MLM with NTP in decoder-only Transformers, enhancing retrieval and reasoning capabilities efficiently.

Findings

01

MEAP outperforms standard NTP on key information retrieval tasks.

02

MEAP improves long-context reasoning performance.

03

Fine-tuning with MEAP yields 11.77% better results in lost-in-the-middle scenarios.

Abstract

Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

scitix/MEAP
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Data Classification

MethodsAttention Is All You Need · Label Smoothing · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam