Toto: Time Series Optimized Transformer for Observability
Ben Cohen, Emaad Khwaja, Kan Wang, Charles Masson, Elise Ram\'e,, Youssef Doubli, Othmane Abou-Amal

TL;DR
Toto is a new transformer-based model trained on a trillion time series data points, achieving state-of-the-art results in forecasting and observability metrics across multiple domains.
Contribution
It introduces Toto, the first general-purpose time series forecasting model specifically tuned for observability, trained on the largest dataset to date.
Findings
Outperforms existing models on observability data
Achieves state-of-the-art zero-shot performance on benchmarks
Trained on one trillion data points
Abstract
This technical report describes the Time Series Optimized Transformer for Observability (Toto), a new state of the art foundation model for time series forecasting developed by Datadog. In addition to advancing the state of the art on generalized time series benchmarks in domains such as electricity and weather, this model is the first general-purpose time series forecasting foundation model to be specifically tuned for observability metrics. Toto was trained on a dataset of one trillion time series data points, the largest among all currently published time series foundation models. Alongside publicly available time series datasets, 75% of the data used to train Toto consists of fully anonymous numerical metric data points from the Datadog platform. In our experiments, Toto outperforms existing time series foundation models on observability data. It does this while also excelling…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
* The application to time series forecasting in observability is intriguing and valuable. A step toward unified benchmarking and a useful dataset is commendable. * The design of proportional factorized space-time attention potentially offers an expressive and efficient backbone for multivariate time series modeling.
* The work lacks clear motivation; the authors’ objective is difficult to discern. * If the aim is a superior observability model, why opt for a foundation model? Any deep forecasting backbone could be trained on similar data for comparison. Instead, only zero-shot models are evaluated on the observability benchmark, where presumably the proposed model was pre-trained. * If the goal is to demonstrate generalization of the model trained on observability data to other domains, more comprehensi
**S1:** The writing style is concise, and the methodology is well-organized and clearly described. **S2:** The paper introduces a new dataset, a large dataset of proprietary observability metrics, which contains statistical characteristics absent from existing datasets. **S3:** The study employs a dual perspective in the attention mechanism (space-wise and time-wise), and uses this model for pre-training a foundation model to validate its performance.
**W1:** The paper’s two primary contributions are the novel observability data and the newly designed foundation model. However, neither the data nor the model are publicly available, raising concerns about reproducibility and broader applicability. **W2:** The technical novelty is limited; for instance, the probabilistic forecasting using the Student-T mixture model (SMM) is an extension, as the Student-T distribution as a prediction head has already been proposed [1]. **W3:** In Table 1, the
The main strength of the paper of the paper is that it introduces a foundation model specifically for the observability domain, which is an important practical problem. Using a foundation model that can be rolled out broadly on an entire system without local model fitting or retraining is a sound approach to address this problem [**Originality and Significance**]. The paper is clearly written and easy to follow [**Clarity**].
The two main weaknesses of this manuscript are that the architectural contributions are incremental and the validation on the long sequence forecasting benchmark requires either revision or clarification. **Incremental contribution**: I would argue that the main contribution of this paper is the observability dataset. A high quality dataset for this domain would be helpful to further develop foundation models for observability specifically and for time series foundation models more generally.
The paper is well written and provides an insightful discussion on the specific case of observability time series data. Some design choices such as the factorized attention mechanism and mixture of student's-t head have been reasonably justified.
- The paper proposes another pretrained model for time series such as Chronos, TimesFM and Moirai. When compared with these existing models, the technical contribution of Toto is marginal. The mixture of distributions idea is not new and has been studied in forecasting literature before, most recently even in the context of pretrained models (Moirai). Furthermore, the alternating spatial and temporal attention blocks idea is incremental. - In the absence of strong methodological contributions,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting · Neural Networks and Applications
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Adam · Dropout
