MaBERT:A Padding Safe Interleaved Transformer Mamba Hybrid Encoder for Efficient Extended Context Masked Language Modeling
Jinwoong Kim, Sangjin Park

TL;DR
MaBERT is a hybrid transformer model that combines global dependency modeling with efficient linear-time state updates, enabling effective long-context processing with reduced training and inference costs.
Contribution
It introduces a novel interleaved hybrid encoder architecture with paddingsafe masking and mask aware attention pooling for efficient long-context language modeling.
Findings
Achieves top scores on five GLUE tasks, especially CoLA and sentence inference.
Reduces training time and inference latency by over 2.3 times for extended contexts.
Demonstrates effective long-context modeling with 4096 tokens.
Abstract
Self attention encoders such as Bidirectional Encoder Representations from Transformers(BERT) scale quadratically with sequence length, making long context modeling expensive. Linear time state space models, such as Mamba, are efficient; however, they show limitations in modeling global interactions and can suffer from padding induced state contamination. We propose MaBERT, a hybrid encoder that interleaves Transformer layers for global dependency modeling with Mamba layers for linear time state updates. This design alternates global contextual integration with fast state accumulation, enabling efficient training and inference on long inputs. To stabilize variable length batching, we introduce paddingsafe masking, which blocks state propagation through padded positions, and mask aware attention pooling, which aggregates information only from valid tokens. On GLUE, MaBERT achieves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Topic Modeling · Machine Learning in Healthcare
