Performance Prediction for Large Systems via Text-to-Text Regression
Yash Akhauri, Bryan Lewandowski, Cheng-Hsi Lin, Adrian N. Reyes, Grant C. Forbes, Arissa Wongpanich, Bangding Yang, Mohamed S. Abdelfattah, Sagi Perel, Xingyou Song

TL;DR
This paper introduces a text-to-text regression approach using large encoder-decoder models to predict system metrics from unstructured text data, outperforming traditional tabular methods in large-scale systems like Google's Borg.
Contribution
It presents a scalable, general method for predicting system outcomes directly from text, demonstrating high accuracy and adaptability with minimal data.
Findings
Achieves 0.99 rank correlation and 100x lower MSE than tabular methods on Borg.
Model adapts to new tasks with only 500 examples.
Highlights the importance of encoder use, sequence length, and uncertainty quantification.
Abstract
In many industries, predicting metric outcomes of large systems is a fundamental problem, driven largely by traditional tabular regression. However, such methods struggle on complex systems data in the wild such as configuration files or system logs, where feature engineering is often infeasible. We propose text-to-text regression as a general, scalable alternative. For predicting resource efficiency on Borg, Google's massive compute cluster scheduling system, a 60M parameter encoder-decoder, trained from random initialization, achieves up to a near perfect 0.99 (0.9 average) rank correlation across the entire fleet, and 100x lower MSE than tabular approaches. The model also easily adapts to new tasks in only 500 few-shot examples and captures the densities of complex outcome distributions. Ablation studies highlight the importance of using encoders, increasing sequence length, and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Machine Learning in Materials Science · Distributed and Parallel Computing Systems
