Wavelet-based Positional Representation for Long Context

Yui Oka; Taku Hasegawa; Kyosuke Nishida; Kuniko Saito

arXiv:2502.02004·cs.CL·February 5, 2025

Wavelet-based Positional Representation for Long Context

Yui Oka, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces a wavelet-based positional encoding method for large language models that enhances long sequence extrapolation by capturing multiple scales without restricting attention, outperforming existing methods.

Contribution

We propose a novel wavelet transform-based position representation that captures multi-scale information, improving long context handling in language models.

Findings

01

Improved performance in both short and long contexts.

02

Enhanced extrapolation of position information.

03

Outperforms existing position encoding methods.

Abstract

In the realm of large-scale language models, a significant challenge arises when extrapolating sequences beyond the maximum allowable length. This is because the model's position embedding mechanisms are limited to positions encountered during training, thus preventing effective representation of positions in longer sequences. We analyzed conventional position encoding methods for long contexts and found the following characteristics. (1) When the representation dimension is regarded as the time axis, Rotary Position Embedding (RoPE) can be interpreted as a restricted wavelet transform using Haar-like wavelets. However, because it uses only a fixed scale parameter, it does not fully exploit the advantages of wavelet transforms, which capture the fine movements of non-stationary signals using multiple scales (window sizes). This limitation could explain why RoPE performs poorly in…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 4

Strengths

1. The paper is easy-to-follow. 2. The length extrapolation problem is important for language models.

Weaknesses

1. **There exist gaps from the motivation to the proposed approach**. In Section 3 and 4, the authors provide analysis on the relationship between RoPE and wavelet transform, and the properties of ALiBi like positional encodings. The ability of ALiBi to accommodate multiple window sizes is concluded as the key point for better length extrapolation performance, while RoPE is claimed to be worse. Based on these statements, a natural question is, what is the advantage of using wavelet transform? Th

Reviewer 02Rating 5Confidence 4

Strengths

- The authors provide a solid theoretical foundation by drawing parallels between RoPE and wavelet transforms, and by extending this analogy to propose their method. - The proposed method has the potential to be widely applicable to various transformer-based models - The paper is easy to read and positions itself clearly with respect to related work.

Weaknesses

- The paper primarily evaluates the method on language modeling tasks. It would be valuable to see how the approach generalizes to other NLP tasks such as question answering or text summarization. - The paper does not provide sufficient experimental evidence to support the authors' claim that Wavelet Transform can capture the dynamic changes in a sequence over positions.（L84-85） - See my questions/suggestions below

Reviewer 03Rating 6Confidence 4

Strengths

- The paper is well-written barring some typographical errors. It motivates the problem well, starts with a well-defined goal, describes existing methods clearly, and presents the proposed method in a manner which is easy to appreciate. - The paper tackles a critical problem of context length extrapolation which often arises in practical settings. The method holds significance, not just for the language modeling community, but also other domains such as time series forecasting where context leng

Weaknesses

- While the discussion of related methods is generally well done, discussion of RoPE scaling techniques (linear, NTK-aware) is missing. A discussion and comparison with these techniques would significantly improve the positioning of this work. - The results reported in sections 6 and 7 are excellent proofs of concept but they lack comprehensiveness. Particularly, in section 7, evaluations beyond the CodeParrot dataset would be needed to thoroughly appreciate the proposed method. Furthermore, the

Videos

Wavelet-based Positional Representation for Long Context· slideslive

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques

MethodsSoftmax · Attention Is All You Need