TL;DR
This paper introduces recurrent language models with a novel inference strategy that enables fast, constant-memory text embeddings, offering a practical alternative to transformer models for processing long sequences.
Contribution
The authors propose a vertically chunked inference method for recurrent models, achieving efficient, low-memory text embeddings with competitive performance.
Findings
Recurrent models with the new inference strategy have constant memory usage beyond a certain input length.
Fine-tuned Mamba2 models perform competitively on various benchmarks.
The inference approach is validated across multiple recurrent architectures, showing consistent runtime-memory benefits.
Abstract
Transformer-based embedding models suffer from quadratic computational and linear memory complexity, limiting their utility for long sequences. We propose recurrent architectures as an efficient alternative, introducing a vertically chunked inference strategy that enables fast embedding generation with memory usage that becomes constant in the input length once it exceeds the vertical chunk size. By fine-tuning Mamba2 models, we demonstrate their viability as general-purpose text embedders, achieving competitive performance across a range of benchmarks while maintaining a substantially smaller memory footprint compared to transformer-based counterparts. We empirically validate the applicability of our inference strategy to Mamba2, RWKV, and xLSTM models, confirming consistent runtime-memory trade-offs across architectures and establishing recurrent models as a compelling alternative to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
