Seq vs Seq: An Open Suite of Paired Encoders and Decoders
Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, Benjamin Van Durme

TL;DR
This paper introduces a comprehensive suite of paired encoder and decoder models trained on large datasets, demonstrating state-of-the-art performance and analyzing their suitability for different tasks, while providing open resources for future research.
Contribution
It presents the Ettin suite of paired encoder-decoder models trained under unified recipes, establishing new benchmarks and analyzing cross-architecture task performance.
Findings
Encoder models excel at classification and retrieval.
Decoder models excel at generative tasks.
Cross-architecture adaptation via continued training is suboptimal.
Abstract
The large language model (LLM) community focuses almost exclusively on decoder-only language models, since they are easier to use for text generation. However, a large subset of the community still uses encoder-only models for tasks such as classification or retrieval. Previous work has attempted to compare these architectures, but is forced to make comparisons with models that have different numbers of parameters, training techniques, and datasets. We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens. Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes, beating ModernBERT as an encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we find that encoder-only models…
Peer Reviews
Decision·ICLR 2026 Poster
1. The shift from decoder to encoder on LLMs is still not scientifically clear. This paper present paired encoder and decoder models for people to study on and has done some initial exploration. 2. The paper is fully opensource in terms of weights and training pipeline.
1. There are research on turning decoder-only models to encoders like [1], which authors may need to take into account when claiming “encoder models are better at classification”. The possibility of “with proper finetuning, decoder-only model outperforms encoder-only model on non-generative tasks” cannot be ruled out by the current set of experiments. Therefore some statement could be premature/misleading. 2. Authors’ efforts in opensourcing large-scale encoder models (compared to other encoder-
- A large set of open models trained on open data makes it possible to make comparisons that were previously only approximate (e.g. comparing encoders an decoders of different sizes, or of similar sizes but with otherwise very different training conditions). - The findings are interesting. It is especially interesting to see that decoders can't be easily "converted" into equally strong encoders via continued pre-training, and vice versa. - The paper is generally presented well. It provides g
- The paper should provide more detail about how the hyperparameters were chosen. For example, it is not obvious to me that using identical hyperparameters for encoders and decoders is the best choice. Ideally there would be some tuning done and a study of the sensitivity to hyperparameter choices. Could the results be improved with different choices of hyperparameters? Could they change enough to modify the findings? - The results are given without any analysis into why the findings are
1. An apple-to-apple suite of encoder and decoder models trained with identical recipes and data. This eliminates confounding factors present in previous studies. 2. All model weights, checkpoints, and training data orders are released. This enables reproducibility and further analysis. 3. Both encoder and decoder models achieve state-of-the-art performance for their size, outperforming ModernBERT (encoder) and SmolLM2/LLaMA 3.2 1B (decoder) baselines. 4. Demonstrates that continued training of
1. Cross-training adapting decoder into encoder often applies masked language modeling. Is it helpful also add masked next token prediction or unsupervised contrastive learning like [1]. 2. The evaluation is centered on GLUE, MTEB, and knowledge-based benchmark. Can you also math or coding benchmarks to see the reasoning abilities of the models. References: BehnamGhader et al. LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders. COLM 24
Code & Models
- 🤗jhu-clsp/ettin-encoder-150mmodel· 19k dl· ♡ 1019k dl♡ 10
- 🤗jhu-clsp/ettin-encoder-400mmodel· 257 dl· ♡ 11257 dl♡ 11
- 🤗jhu-clsp/ettin-encoder-68mmodel· 9.7k dl· ♡ 49.7k dl♡ 4
- 🤗jhu-clsp/ettin-encoder-32mmodel· 833 dl· ♡ 11833 dl♡ 11
- 🤗jhu-clsp/ettin-encoder-17mmodel· 16k dl· ♡ 1516k dl♡ 15
- 🤗jhu-clsp/ettin-checkpointsmodel
- 🤗jhu-clsp/ettin-enc-from-dec-32mmodel· 33 dl33 dl
- 🤗jhu-clsp/ettin-enc-from-dec-68mmodel· 1 dl1 dl
- 🤗jhu-clsp/ettin-enc-from-dec-150mmodel
- 🤗jhu-clsp/ettin-enc-from-dec-400mmodel· 3 dl· ♡ 13 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Scientific Computing and Data Management
MethodsLLaMA
