Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model
Biao Zhang, Yong Cheng, Siamak Shakeri, Xinyi Wang, Min Ma, Orhan Firat

TL;DR
This paper compares encoder-decoder and decoder-only large language models across different scales, revealing that encoder-decoder models can be competitive in performance and efficiency, challenging the current dominance of decoder-only architectures.
Contribution
It provides a comprehensive, scale-aware comparison of encoder-decoder and decoder-only LLMs, demonstrating the potential of encoder-decoder models with recent training recipes.
Findings
RedLLM shows strong scaling and extrapolation capabilities.
RedLLM achieves comparable or better downstream task performance after instruction tuning.
RedLLM offers substantially better inference efficiency than DecLLM.
Abstract
Recent large language model (LLM) research has undergone an architectural shift from encoder-decoder modeling to nowadays the dominant decoder-only modeling. This rapid transition, however, comes without a rigorous comparative analysis especially \textit{from the scaling perspective}, raising concerns that the potential of encoder-decoder models may have been overlooked. To fill this gap, we revisit encoder-decoder LLM (RedLLM), enhancing it with recent recipes from decoder-only LLM (DecLLM). We conduct a comprehensive comparison between RedLLM, pretrained with prefix language modeling (LM), and DecLLM, pretrained with causal LM, at different model scales, ranging from 150M to 8B. Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for instruction tuning, our experiments show that RedLLM produces compelling scaling properties and surprisingly strong performance. While…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
