Sampling Foundational Transformer: A Theoretical Perspective

Viet Anh Nguyen; Minh Lenhat; Khoa Nguyen; Duong Duc Hieu; Dao Huu; Hung; Truong Son Hy

arXiv:2408.05822·cs.LG·August 20, 2024

Sampling Foundational Transformer: A Theoretical Perspective

Viet Anh Nguyen, Minh Lenhat, Khoa Nguyen, Duong Duc Hieu, Dao Huu, Hung, Truong Son Hy

PDF

Open Access

TL;DR

Sampling Foundational Transformer (SFT) introduces a multi-modality transformer model with a novel sampling mechanism and pseudoconvex formulation, achieving efficient, fast inference and competitive results across various data types.

Contribution

The paper presents SFT, a transformer model capable of handling multiple data modalities with improved efficiency and convergence, using novel sampling and optimization techniques.

Findings

01

Achieves competitive benchmark results across multiple data modalities.

02

Offers faster inference compared to specialized models.

03

Utilizes a novel pseudoconvex formulation for improved convergence.

Abstract

The versatility of self-attention mechanism earned transformers great success in almost all data modalities, with limitations on the quadratic complexity and difficulty of training. To apply transformers across different data modalities, practitioners have to make specific clever data-modality-dependent constructions. In this paper, we propose Sampling Foundational Transformer (SFT) that can work on multiple data modalities (e.g., point cloud, graph, and sequence) and constraints (e.g., rotational-invariant). The existence of such model is important as contemporary foundational modeling requires operability on multiple data sources. For efficiency on large number of tokens, our model relies on our context aware sampling-without-replacement mechanism for both linear asymptotic computational complexity and real inference time gain. For efficiency, we rely on our newly discovered…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Signal Denoising Methods · Neural Networks and Applications

MethodsLinear Layer · Residual Connection · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Attentive Walk-Aggregating Graph Neural Network · Softmax · Absolute Position Encodings