MaBERT:A Padding Safe Interleaved Transformer Mamba Hybrid Encoder for Efficient Extended Context Masked Language Modeling

Jinwoong Kim; Sangjin Park

arXiv:2603.03001·cs.CL·March 4, 2026

MaBERT:A Padding Safe Interleaved Transformer Mamba Hybrid Encoder for Efficient Extended Context Masked Language Modeling

Jinwoong Kim, Sangjin Park

PDF

Open Access

TL;DR

MaBERT is a hybrid transformer model that combines global dependency modeling with efficient linear-time state updates, enabling effective long-context processing with reduced training and inference costs.

Contribution

It introduces a novel interleaved hybrid encoder architecture with paddingsafe masking and mask aware attention pooling for efficient long-context language modeling.

Findings

01

Achieves top scores on five GLUE tasks, especially CoLA and sentence inference.

02

Reduces training time and inference latency by over 2.3 times for extended contexts.

03

Demonstrates effective long-context modeling with 4096 tokens.

Abstract

Self attention encoders such as Bidirectional Encoder Representations from Transformers(BERT) scale quadratically with sequence length, making long context modeling expensive. Linear time state space models, such as Mamba, are efficient; however, they show limitations in modeling global interactions and can suffer from padding induced state contamination. We propose MaBERT, a hybrid encoder that interleaves Transformer layers for global dependency modeling with Mamba layers for linear time state updates. This design alternates global contextual integration with fast state accumulation, enabling efficient training and inference on long inputs. To stabilize variable length batching, we introduce paddingsafe masking, which blocks state propagation through padded positions, and mask aware attention pooling, which aggregates information only from valid tokens. On GLUE, MaBERT achieves the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Topic Modeling · Machine Learning in Healthcare