Sampling Foundational Transformer: A Theoretical Perspective
Viet Anh Nguyen, Minh Lenhat, Khoa Nguyen, Duong Duc Hieu, Dao Huu, Hung, Truong Son Hy

TL;DR
Sampling Foundational Transformer (SFT) introduces a multi-modality transformer model with a novel sampling mechanism and pseudoconvex formulation, achieving efficient, fast inference and competitive results across various data types.
Contribution
The paper presents SFT, a transformer model capable of handling multiple data modalities with improved efficiency and convergence, using novel sampling and optimization techniques.
Findings
Achieves competitive benchmark results across multiple data modalities.
Offers faster inference compared to specialized models.
Utilizes a novel pseudoconvex formulation for improved convergence.
Abstract
The versatility of self-attention mechanism earned transformers great success in almost all data modalities, with limitations on the quadratic complexity and difficulty of training. To apply transformers across different data modalities, practitioners have to make specific clever data-modality-dependent constructions. In this paper, we propose Sampling Foundational Transformer (SFT) that can work on multiple data modalities (e.g., point cloud, graph, and sequence) and constraints (e.g., rotational-invariant). The existence of such model is important as contemporary foundational modeling requires operability on multiple data sources. For efficiency on large number of tokens, our model relies on our context aware sampling-without-replacement mechanism for both linear asymptotic computational complexity and real inference time gain. For efficiency, we rely on our newly discovered…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Signal Denoising Methods · Neural Networks and Applications
MethodsLinear Layer · Residual Connection · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Attentive Walk-Aggregating Graph Neural Network · Softmax · Absolute Position Encodings
