Enhanced Momentum with Momentum Transformers
Max Mason, Waasi A Jagirdar, David Huang, Rahul Murugan

TL;DR
This paper introduces an enhanced Momentum Transformer model that combines attention mechanisms with LSTMs to improve long-term dependency capture in time-series trading, demonstrating comparable returns to prior models but with higher volatility.
Contribution
The paper extends previous Momentum Transformer architectures to equities, integrating attention with LSTMs for better long-term dependency modeling in stock trading.
Findings
Average return of 4.14% achieved
Sharpe ratio of 1.12 indicating higher volatility
Model adapts well to market changes like Covid pandemic
Abstract
The primary objective of this research is to build a Momentum Transformer that is expected to outperform benchmark time-series momentum and mean-reversion trading strategies. We extend the ideas introduced in the paper Trading with the Momentum Transformer: An Intelligent and Interpretable Architecture to equities as the original paper primarily only builds upon futures and equity indices. Unlike conventional Long Short-Term Memory (LSTM) models, which operate sequentially and are optimized for processing local patterns, an attention mechanism equips our architecture with direct access to all prior time steps in the training window. This hybrid design, combining attention with an LSTM, enables the model to capture long-term dependencies, enhance performance in scenarios accounting for transaction costs, and seamlessly adapt to evolving market conditions, such as those witnessed during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExperimental and Theoretical Physics Studies
MethodsAttention Is All You Need · Linear Layer · Dropout · Dense Connections · Byte Pair Encoding · Multi-Head Attention · Adam · Layer Normalization · Sigmoid Activation · Position-Wise Feed-Forward Layer
