Order-Level Attention Similarity Across Language Models: A Latent Commonality
Jinglin Liang, Jin Zhong, Shuangping Huang, Yunqing Hu, Huiyuan Zhang, Huifang Li, Lixin Fan, and Hanlin Gu

TL;DR
This paper investigates common patterns in attention mechanisms across different language models, revealing similarities at the order level, and introduces a training-free adapter that leverages these patterns for improved cross-model transfer.
Contribution
It introduces Order-Level Attention (OLA) to analyze attention similarities across LMs and proposes the Transferable OLA Adapter (TOA) for effective cross-LM transfer without additional training.
Findings
OLA shows significant similarity across different LMs at the same order.
The TOA method improves performance on unseen LMs without parameter updates.
Cross-LM transfer is effectively enhanced using OLA-based features.
Abstract
In this paper, we explore an important yet previously neglected question: Do context aggregation patterns across Language Models (LMs) share commonalities? While some works have investigated context aggregation or attention weights in LMs, they typically focus on individual models or attention heads, lacking a systematic analysis across multiple LMs to explore their commonalities. In contrast, we focus on the commonalities among LMs, which can deepen our understanding of LMs and even facilitate cross-model knowledge transfer. In this work, we introduce the Order-Level Attention (OLA) derived from the order-wise decomposition of Attention Rollout and reveal that the OLA at the same order across LMs exhibits significant similarities. Furthermore, we discover an implicit mapping between OLA and syntactic knowledge. Based on these two findings, we propose the Transferable OLA Adapter (TOA),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
