Seq vs Seq: An Open Suite of Paired Encoders and Decoders

Orion Weller; Kathryn Ricci; Marc Marone; Antoine Chaffin; Dawn Lawrie; Benjamin Van Durme

arXiv:2507.11412·cs.CL·March 13, 2026

Seq vs Seq: An Open Suite of Paired Encoders and Decoders

Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, Benjamin Van Durme

PDF

Open Access 1 Repo 10 Models 4 Datasets 3 Reviews

TL;DR

This paper introduces a comprehensive suite of paired encoder and decoder models trained on large datasets, demonstrating state-of-the-art performance and analyzing their suitability for different tasks, while providing open resources for future research.

Contribution

It presents the Ettin suite of paired encoder-decoder models trained under unified recipes, establishing new benchmarks and analyzing cross-architecture task performance.

Findings

01

Encoder models excel at classification and retrieval.

02

Decoder models excel at generative tasks.

03

Cross-architecture adaptation via continued training is suboptimal.

Abstract

The large language model (LLM) community focuses almost exclusively on decoder-only language models, since they are easier to use for text generation. However, a large subset of the community still uses encoder-only models for tasks such as classification or retrieval. Previous work has attempted to compare these architectures, but is forced to make comparisons with models that have different numbers of parameters, training techniques, and datasets. We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens. Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes, beating ModernBERT as an encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we find that encoder-only models…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The shift from decoder to encoder on LLMs is still not scientifically clear. This paper present paired encoder and decoder models for people to study on and has done some initial exploration. 2. The paper is fully opensource in terms of weights and training pipeline.

Weaknesses

1. There are research on turning decoder-only models to encoders like [1], which authors may need to take into account when claiming “encoder models are better at classification”. The possibility of “with proper finetuning, decoder-only model outperforms encoder-only model on non-generative tasks” cannot be ruled out by the current set of experiments. Therefore some statement could be premature/misleading. 2. Authors’ efforts in opensourcing large-scale encoder models (compared to other encoder-

Reviewer 02Rating 6Confidence 3

Strengths

- A large set of open models trained on open data makes it possible to make comparisons that were previously only approximate (e.g. comparing encoders an decoders of different sizes, or of similar sizes but with otherwise very different training conditions). - The findings are interesting. It is especially interesting to see that decoders can't be easily "converted" into equally strong encoders via continued pre-training, and vice versa. - The paper is generally presented well. It provides g

Weaknesses

- The paper should provide more detail about how the hyperparameters were chosen. For example, it is not obvious to me that using identical hyperparameters for encoders and decoders is the best choice. Ideally there would be some tuning done and a study of the sensitivity to hyperparameter choices. Could the results be improved with different choices of hyperparameters? Could they change enough to modify the findings? - The results are given without any analysis into why the findings are

Reviewer 03Rating 6Confidence 4

Strengths

1. An apple-to-apple suite of encoder and decoder models trained with identical recipes and data. This eliminates confounding factors present in previous studies. 2. All model weights, checkpoints, and training data orders are released. This enables reproducibility and further analysis. 3. Both encoder and decoder models achieve state-of-the-art performance for their size, outperforming ModernBERT (encoder) and SmolLM2/LLaMA 3.2 1B (decoder) baselines. 4. Demonstrates that continued training of

Weaknesses

1. Cross-training adapting decoder into encoder often applies masked language modeling. Is it helpful also add masked next token prediction or unsupervised contrastive learning like [1]. 2. The evaluation is centered on GLUE, MTEB, and knowledge-based benchmark. Can you also math or coding benchmarks to see the reasoning abilities of the models. References: BehnamGhader et al. LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders. COLM 24

Code & Models

Repositories

jhu-clsp/ettin-encoder-vs-decoder
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Scientific Computing and Data Management

MethodsLLaMA