Multi-branch of Attention Yields Accurate Results for Tabular Data
Xuechen Li, Yupeng Li, Jian Liu, Xiaolin Jin, Xin Hu

TL;DR
MAYA introduces a multi-branch attention mechanism within a transformer framework to better handle feature heterogeneity in tabular data, achieving superior classification and regression results.
Contribution
The paper proposes MAYA, a novel transformer-based framework with multi-branch attention and collaborative learning for improved tabular data modeling.
Findings
Outperforms existing transformer-based methods on various datasets.
Effectively fuses heterogeneous features with limited parameter increase.
Achieves state-of-the-art results in tabular classification and regression.
Abstract
Tabular data inherently exhibits significant feature heterogeneity, but existing transformer-based methods lack specialized mechanisms to handle this property. To bridge the gap, we propose MAYA, an encoder-decoder transformer-based framework. In the encoder, we design a Multi-Branch of Attention (MBA) that constructs multiple parallel attention branches and averages the features at each branch, effectively fusing heterogeneous features while limiting parameter growth. Additionally, we employ collaborative learning with a dynamic consistency weight constraint to produce more robust representations. In the decoder stage, cross-attention is utilized to seamlessly integrate tabular data with corresponding label features. This dual-attention mechanism effectively captures both intra-instance and inter-instance interactions. We evaluate the proposed method on a wide range of datasets and…
Peer Reviews
Decision·Submitted to ICLR 2026
- **Novel and Efficient Encoder (MBA)**: The MBA block is the paper's strongest point. It is a well-motivated and clever solution to the parameter-growth problem in standard MHA-based tabular models. The design (parallel branches with weighted averaging) is simple, effective, and well-supported by ablations and visualizations. - **Strong Results vs. Other Transformers**: The paper is empirically rigorous within its chosen subgroup. It demonstrates SOTA performance against a comprehensive suite
- **Missing Critical Baselines (GBDTs)**: This is a fatal flaw. Any paper claiming SOTA on tabular data must compare against tuned Gradient-Boosted Decision Trees (e.g., XGBoost, CatBoost). The paper only compares against other Transformers. This is not a fair or complete comparison, and the "superior performance" claim is unsubstantiated. - **Missing Critical Baselines (TabR)**: The IAIL decoder's inference mechanism—using the entire training set as a Key/Value store —is a kNN-like retrieval m
- **Architectural intuition and originality** I think the Multi-Branch of Attention (MBA) module is a well-conceived and intuitive idea. Tabular data often exhibit highly irregular and non-smooth decision boundaries—an aspect that tree-based models have long exploited through ensemble partitioning. MBA can be viewed as a soft, attention-based analogue of this principle: multiple attention branches learn complementary sub-spaces, akin to an ensemble, but within a unified Transformer framework. T
- **Lack of comparison with TabPFN.** The experiments are extensive but omit a comparison with TabPFN, which has recently set a strong benchmark for tabular learning. Such a comparison would be helpful to establish the true empirical strength of MAYA. - **Insufficient analysis of data-dependent behavior.** The paper does not examine how the multi-branch attention design performs under different data conditions. Since tabular datasets vary widely in feature type composition, noise level, and s
- The paper uses a well-established benchmark suite, the Grinsztajn benchmark. - The paper addresses heterogeneous feature distributions, a common source of low performance for neural architectures. - The paper does extensive hyper-parameter tuning for all methods, and provides the hyper-parameter spaces that were used. - The paper provides ablation studies for some of the architectural choices.
- The paper does not compare against any state-of-the-art algorithms for regression and classification for tabular data. Limiting the comparison to "transformer-based architectures" is not a meaningful constraint to me. Why would one be interested in the best transformer-based architecture, and not in the best architecture overall? The authors also exclude state-of-the-art pre-trained transformer methods like TabPFNV2, TabDBT and TabICL. A good overview of state-of-the-art methods for tabular cl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Face and Expression Recognition · Bayesian Methods and Mixture Models
