DAPE V2: Process Attention Score as Feature Map for Length Extrapolation

Chuanyang Zheng; Yihang Gao; Han Shi; Jing Xiong; Jiankai Sun; Jingyao; Li; Minbin Huang; Xiaozhe Ren; Michael Ng; Xin Jiang; Zhenguo Li; Yu Li

arXiv:2410.04798·cs.CL·October 11, 2024

DAPE V2: Process Attention Score as Feature Map for Length Extrapolation

Chuanyang Zheng, Yihang Gao, Han Shi, Jing Xiong, Jiankai Sun, Jingyao, Li, Minbin Huang, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, Yu Li

PDF

Open Access 2 Repos 1 Video 3 Reviews

TL;DR

This paper reinterprets Transformer attention as a feature map and applies convolution to improve length extrapolation, leading to significant performance gains and revealing new avenues for model evolution.

Contribution

It introduces a novel perspective of viewing attention as a feature map and applies convolution to address length extrapolation issues in Transformers.

Findings

01

Enhanced Transformer performance with convolution-based attention processing

02

Identified length extrapolation as a feature map problem

03

Potential for further evolution of Transformer architectures

Abstract

The attention mechanism is a fundamental component of the Transformer model, contributing to interactions among distinct tokens, in contrast to earlier feed-forward neural networks. In general, the attention scores are determined simply by the key-query products. However, this work's occasional trial (combining DAPE and NoPE) of including additional MLPs on attention scores without position encoding indicates that the classical key-query multiplication may limit the performance of Transformers. In this work, we conceptualize attention as a feature map and apply the convolution operator (for neighboring attention scores across different heads) to mimic the processing methods in computer vision. Specifically, the main contribution of this paper is identifying and interpreting the Transformer length extrapolation problem as a result of the limited expressiveness of the naive query and key…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

**Originality & Significance:** - The authors slightly expand on the insights of the original DAPE paper and provide new results on the original and their improved variant **Quality:** - Experiments conducted across two datasets in comparison with multiple popular ‘positional embedding’ methods, including NoPE, RoPE, CoPE, ALiBi, Kerple and FiRE - Insights into how the computational complexity is affected are provided, as well as results for three model sizes **Clarity:** - The paper

Weaknesses

_TL;DR: While I appreciate the work the authors have put into the manuscript and their experiments, the main ‘methodological’ novelty facilitating the approach has already been presented in the original DAPE paper. The authors’ addition of using a convolution instead of an MLP (i.e. replacing a 1x1 conv with a 1x3 conv, combined with inconsistent improvements) combined with the manuscript in its current state is in my opinion not enough to pass the bar for ICLR;_ - Minor ‘methodological’ additi

Reviewer 02Rating 5Confidence 4

Strengths

- The paper is clearly written. - Extending the MLP in DAPE’s attention model to 1x3 convolution achieves improvement over multiple experiments.

Weaknesses

- The major concern is this paper only introduces an incremental change over DAPE, I.e. extending the MLP in attention model to 1x3 convolution. In addition, compared to the gap between DAPE and other baselines, the gap between this paper and DAPE is relatively small. - This paper could be written in a more straightforward way, by directly showing the difference between it and DAPE, and highlighting why it is crucial. Readers may have confusion about the contribution of this paper and DAPE. -

Reviewer 03Rating 6Confidence 4

Strengths

+ The application of convolution on attention maps to improve relative position encoding in large language models is both novel and inspiring. The use of convolution, a fast and efficient operation, allows for seamless integration into existing frameworks. + The proposed method demonstrates strong performance on the length extrapolation task, outperforming established techniques such as RoPE and NoPE, which underscores its effectiveness.

Weaknesses

- The paper suffers from poor writing and organizational structure. Basic variables such as X, W_Q, and W_K are not adequately explained as the context, despite that Transformers are quite popular. - Confusing Arguments: 1) Line 181-182 states: "The result of DAPE-NoPE (the Zheng et al. (2024) only combine DAPE with ALiBi, Kerple and FIRE but not with NoPE or RoPE)." This sentence is confusing and seems disconnected from the preceding context. 2) Line 191-192 mentions: "potentially hindering

Code & Models

Repositories

Videos

DAPE V2: Process Attention Score as Feature Map for Length Extrapolation· underline

Taxonomy

TopicsManufacturing Process and Optimization · Fault Detection and Control Systems · Advanced Statistical Process Monitoring

MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings