Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture
Jingze Shi, Bingheng Wu

TL;DR
This paper introduces Wonderful Matrices, a novel foundation model architecture that combines sequence and state transformations to improve efficiency and effectiveness, demonstrating significant performance gains across multiple tasks.
Contribution
It proposes a new matrix-based foundation model architecture that unifies position encoding, enhances multi-query recall, and accelerates expert retrieval, outperforming existing methods.
Findings
Rotary position embedding reduces perplexity by over 4%.
Dynamic mask attention achieves 100% accuracy in complex recall tasks.
Expert retrieval speed increases 8-10 times with over 1024 experts.
Abstract
In order to make the foundation model more efficient and effective, our idea is combining sequence transformation and state transformation. First, we prove the availability of rotary position embedding in the state space duality algorithm, which reduces the perplexity of the hybrid quadratic causal self-attention and state space duality by more than 4%, to ensure that the combining sequence transformation unifies position encoding. Second, we propose dynamic mask attention, which maintains 100% accuracy in the more challenging multi-query associative recall task, improving by more than 150% compared to quadratic causal self-attention and state space duality, to ensure that the combining sequence transformation selectively filters relevant information. Third, we design cross domain mixture of experts, which makes the computational speed of expert retrieval with more than 1024 experts 8…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗SmallDoge/Doge-60Mmodel· 722 dl· ♡ 4722 dl♡ 4
- 🤗wubingheng/Doge-197M-Medical-SFTmodel· 6 dl· ♡ 26 dl♡ 2
- 🤗SmallDoge/Doge-20M-Instructmodel· 701 dl· ♡ 5701 dl♡ 5
- 🤗SmallDoge/Doge-60M-Instructmodel· 653 dl· ♡ 6653 dl♡ 6
- 🤗SmallDoge/Doge-20Mmodel· 1.5k dl· ♡ 91.5k dl♡ 9
- 🤗SmallDoge/Doge-20M-Instruct-SFTmodel· 9 dl9 dl
- 🤗SmallDoge/Doge-60M-Instruct-SFTmodel· 6 dl6 dl
- 🤗dewdev/Doge-60M-Instruct-ONNXmodel· 5 dl5 dl
- 🤗SmallDoge/Doge-160Mmodel· 1.6k dl· ♡ 61.6k dl♡ 6
- 🤗SmallDoge/Doge-160M-Instruct-SFTmodel· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMatrix Theory and Algorithms · Parallel Computing and Optimization Techniques · Computational Geometry and Mesh Generation
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
