Unveiling Markov Heads in Pretrained Language Models for Offline Reinforcement Learning

Wenhao Zhao; Qiushui Xu; Linjie Xu; Lei Song; Jinyu Wang; Chunlai Zhou; Jiang Bian

arXiv:2409.06985·cs.LG·June 10, 2025

Unveiling Markov Heads in Pretrained Language Models for Offline Reinforcement Learning

Wenhao Zhao, Qiushui Xu, Linjie Xu, Lei Song, Jinyu Wang, Chunlai Zhou, Jiang Bian

PDF

Open Access

TL;DR

This paper analyzes the role of Markov heads in pretrained language models used in offline reinforcement learning, revealing their limitations and proposing a new method GPT2-DTMA to improve long-term environment performance.

Contribution

It identifies Markov heads as a key component in PLMs for RL, proves their fixed nature, and introduces GPT2-DTMA with Mixture of Attention to enhance long-term decision-making.

Findings

01

Markov heads focus attention on last input token

02

Extreme attention cannot be altered by re-training or fine-tuning

03

GPT2-DTMA improves long-term environment performance

Abstract

Recently, incorporating knowledge from pretrained language models (PLMs) into decision transformers (DTs) has generated significant attention in offline reinforcement learning (RL). These PLMs perform well in RL tasks, raising an intriguing question: what kind of knowledge from PLMs has been transferred to RL to achieve such good results? This work first dives into this problem by analyzing each head quantitatively and points out Markov head, a crucial component that exists in the attention heads of PLMs. It leads to extreme attention on the last-input token and performs well only in short-term environments. Furthermore, we prove that this extreme attention cannot be changed by re-training embedding layer or fine-tuning. Inspired by our analysis, we propose a general method GPT2-DTMA, which equips a pretrained DT with Mixture of Attention (MoA), to accommodate diverse attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsSoftmax · Attention Is All You Need