FATE: Focal-modulated Attention Encoder for Multivariate Time-series Forecasting
Tajamul Ashraf, Janibul Bashir

TL;DR
FATE is a novel transformer-based model with focal modulation for multivariate time-series forecasting, explicitly capturing spatiotemporal correlations and providing interpretability, outperforming state-of-the-art methods across diverse datasets.
Contribution
Introduction of FATE, a transformer architecture with tensorized focal modulation for improved multivariate time-series forecasting and interpretability.
Findings
FATE outperforms existing models on seven real-world datasets.
FATE demonstrates strong generalization to various multivariate forecasting tasks.
The proposed modulation scores enhance interpretability of environmental feature influences.
Abstract
Climate change stands as one of the most pressing global challenges of the twenty-first century, with far-reaching consequences such as rising sea levels, melting glaciers, and increasingly extreme weather patterns. Accurate forecasting is critical for monitoring these phenomena and supporting mitigation strategies. While recent data-driven models for time-series forecasting, including CNNs, RNNs, and attention-based transformers, have shown promise, they often struggle with sequential dependencies and limited parallelization, especially in long-horizon, multivariate meteorological datasets. In this work, we present Focal Modulated Attention Encoder (FATE), a novel transformer architecture designed for reliable multivariate time-series forecasting. Unlike conventional models, FATE introduces a tensorized focal modulation mechanism that explicitly captures spatiotemporal correlations in…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1 Clear architectural novelty: a principled tensorized design that preserves temporal and feature axes; focal modulation tailored to time-series (temporal focal groups) rather than spatial grids; cross-axis modulation for multivariate dependencies. 2 Interpretability: dual modulation scores linking heads to stations, with compelling visualizations that show evolving spatial focus as horizon increases. 3 Strong empirical performance: consistent improvements on diverse datasets, including large-
1 Hyperparameters differ substantially across models (batch sizes, layers/heads), and several strong modern baselines are missing or lightly tuned 2 Claims of “moderate overhead” vs. Transformers are not backed by FLOPs/latency/memory scaling curves across sequence length, stations, and features; no wall-clock comparisons at different horizons or ablations on focal levels. 3 For Traffic, FATE’s MAE is slightly worse than PatchTST (though MSE improves). A broader analysis of error distribution
1. This paper extends focal modulation (from vision) to a tensorized, dual-axis scheme for multivariate time series, preserving temporal and variable axes and introducing dual modulation scores for interpretability. 2. Broad evaluation across 7 datasets with long-horizon regimes is conducted; consistent accuracy gains are reported, including on large-scale traffic where FATE improves MAE/MSE over the best GNN baselines. Qualitative modulation maps align with the narrative about dynamic spatial d
1. The distinctions from recent tensor/patch or efficiency-oriented methods (e.g., tensorized attention variants, multi-scale mixers) are not clear; ablations isolate focal levels and gating qualitatively, but do not study which tensorization choices (e.g., per-axis PE, grouped projections) are essential vs. incidental. A one-for-one replacement study (e.g., FATE vs. Time-tensorized attention with identical training) is missing. 2. The paper states “moderate overhead” and “comparable to baseline
- **Innovative Architectural Design** The use of tensorized focal modulation to preserve the 3D input structure is a significant innovation. This approach effectively models spatiotemporal dependencies and cross-feature interactions, which are critical for multivariate time-series forecasting. - **Thorough Evaluation** The authors conduct comprehensive experiments, including ablation studies, to validate the impact of each component. Visualizations of focal modulation and attention dyna
- **Limited Discussion on Computational Trade-offs** While the paper highlights that FATE introduces moderate computational overhead, it does not provide a detailed comparison of training and inference times against lightweight baselines like linear models. - **Inconsistent Performance on Certain Datasets** Although FATE achieves SOTA results on most benchmarks, its performance on the Europe dataset is relatively weaker, with LSTM models outperforming it in several scenarios. - **Spars
The exploration of spatial-temporal data with the concept of locality is plausible. The experiment expands on multiple areas to show the model's performance gains.
**W1:** Notation lacks definitions: The notation system can be improved. A lot of terms have been used before defining. For example, in lines 51-53, the authors directly used T, S, and P for the 3-dimensional tensor without explaining what each dimension stands for. In Eq. (1) PE(\cdot) is also not explained, and many more. Additionally, many notations are not in math format. While these issues do not directly obscure the main ideas, they reduce the clarity and readability of the paper. **W2:**
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Image and Signal Denoising Methods
MethodsLinear Layer · Residual Connection · Multi-Head Attention · Tanh Activation · Adam · Layer Normalization · Attention Is All You Need · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding
