Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in   Transformers

Qian Chen; Wen Wang; Qinglin Zhang; Siqi Zheng; Shiliang Zhang; Chong; Deng; Hai Yu; Jiaqing Liu; Yukun Ma; Chong Zhang

arXiv:2406.11274·cs.CL·June 18, 2024

Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers

Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Shiliang Zhang, Chong, Deng, Hai Yu, Jiaqing Liu, Yukun Ma, Chong Zhang

PDF

Open Access

TL;DR

This paper proposes Skip-Layer Attention, a novel mechanism that allows direct attention between non-adjacent Transformer layers, improving the model's ability to capture complex dependencies and enhancing language modeling performance.

Contribution

The paper introduces Skip-Layer Attention, enabling direct inter-layer attention in Transformers, which improves dependency modeling without extra computational cost.

Findings

01

Enhanced language modeling performance

02

Better capture of abstract and detailed dependencies

03

Improved multi-head attention diversity

Abstract

The Transformer architecture has significantly advanced deep learning, particularly in natural language processing, by effectively managing long-range dependencies. However, as the demand for understanding complex relationships grows, refining the Transformer's architecture becomes critical. This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models by enabling direct attention between non-adjacent layers. This method improves the model's ability to capture dependencies between high-level abstract features and low-level details. By facilitating direct attention between these diverse feature levels, our approach overcomes the limitations of current Transformers, which often rely on suboptimal intra-layer attention. Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing · Parallel Computing and Optimization Techniques

MethodsResidual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer