OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to   construct Observer-Thinker-Conceiver-Expresser

Jingze Shi; Ting Xie; Bingheng Wu; Chunjun Zheng; Kai Wang

arXiv:2406.16495·cs.CL·July 23, 2024

OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser

Jingze Shi, Ting Xie, Bingheng Wu, Chunjun Zheng, Kai Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces OTCE, a hybrid architecture combining selective state space models and quadratic self-attention with cross-domain experts, achieving competitive performance in language modeling tasks.

Contribution

It proposes a novel biomimetic architecture that integrates state space models and quadratic attention through hybrid experts with cross-sharing domains.

Findings

01

OTCE competes with medium-scale open-source language models.

02

The hybrid architecture effectively combines advantages of state space and attention mechanisms.

03

Position information injection enhances long-term dependency handling.

Abstract

Recent research has shown that combining Mamba with Transformer architecture, which has selective state space and quadratic self-attention mechanism, outperforms using Mamba or Transformer architecture alone in language modeling tasks. The quadratic self-attention mechanism effectively alleviates the shortcomings of selective state space in handling long-term dependencies of any element in the sequence. We propose a position information injection method that connects the selective state space model with the quadratic attention, and integrates these two architectures with hybrid experts with cross-sharing domains, so that we can enjoy the advantages of both. We design a new architecture with a more biomimetic idea: Observer-Thinker-Conceiver-Expresser (OTCE), which can compete with well-known medium-scale open-source language models on a small scale in language modeling tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LoserCheems/OTCE
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExpert finding and Q&A systems · Topic Modeling

MethodsAttention Is All You Need · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer · Absolute Position Encodings