Making Pre-trained Language Models Great on Tabular Prediction
Jiahuan Yan, Bo Zheng, Hongxia Xu, Yiheng Zhu, Danny Z. Chen, Jimeng, Sun, Jian Wu, Jintai Chen

TL;DR
This paper introduces TP-BERTa, a pre-trained language model tailored for tabular data prediction, using a novel tokenization method to convert numerical features into discrete tokens, improving transferability and performance.
Contribution
The paper proposes TP-BERTa, a novel pre-trained language model for tabular data that employs relative magnitude tokenization and intra-feature attention to enhance prediction accuracy.
Findings
TP-BERTa outperforms existing tabular DNNs in experiments.
TP-BERTa is competitive with Gradient Boosted Decision Trees.
The model effectively integrates feature names and values for better predictions.
Abstract
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing. However, due to the heterogeneity among tables, such DNN bonus is still far from being well exploited on tabular data prediction (e.g., regression or classification tasks). Condensing knowledge from diverse domains, language models (LMs) possess the capability to comprehend feature names from various tables, potentially serving as versatile learners in transferring knowledge across distinct tables and diverse prediction tasks, but their discrete text representation space is inherently incompatible with numerical feature values in tables. In this paper, we present TP-BERTa, a specifically pre-trained LM for tabular data prediction. Concretely, a novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an…
Peer Reviews
Decision·ICLR 2024 spotlight
1. **Originality:** - The approach of "relative magnitude tokenization" (RMT) is an inventive technique for adapting pre-trained language models to tabular data. This method of converting scalar values to a tokenized format to be perceived as meaningful words within the language model's vocabulary stands out as a significant contribution. - The intra-feature attention (IFA) module to fuse feature name and value embeddings is another commendable addition to the field. This ensures a mor
1. It would have been beneficial if the paper delved deeper into the limitations and potential pitfalls of the relative magnitude tokenization technique. Understanding how the granularity of this tokenization might impact model performance, especially in cases with intricate numerical nuances, is crucial. 2. Comparisons with Gradient Boosted Decision Trees are noted, but an in-depth discussion regarding scenarios where GBDTs might outshine or underperform against the proposed TP-BERTa would prov
* Paper is mostly clearly written.
* It is very unclear if some distributional patterns from common tabular data can be generalized to other data with completely different distributions. In text, pretraining can learn generic linguistic features such as meaning of the English words or grammars. In tabular data, the core assumption does not hold because it differs drastically between data sources. * The empirical comparison does not essentially provide evidence that pre-training is the main factor to improve the performance of the
1. According to Table 1, the performance of TP-BERTa is strong. It consistently outperforms other tabular DL models and is comparable to XGBoost and CatBoost. 2. The idea of applying relative magnitude tokens (RMT) in tabular DL models is novel according to my knowledge. As pointed out by the "On Embeddings for Numerical Features in Tabular Deep Learning" paper, appropriately embed numerical features is important for tabular DL models. RMT can enhance language models in handling the numerical va
1. Compared with RMT, the intra-feature attention method is marginally novel and is not showing significant performance boost. 2. The author has not studied the impact of pretrain data diversity in the performance of TP-BERTa. For example, how good will TP-BERTa be if it is only pretrained on 10 classification datasets and 10 regression datasets?
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
