# Leveraging Pre-trained Checkpoints for Sequence Generation Tasks

**Authors:** Sascha Rothe, Shashi Narayan, Aliaksei Severyn

arXiv: 1907.12461 · 2022-08-10

## TL;DR

This paper demonstrates that initializing sequence-to-sequence models with pre-trained checkpoints like BERT, GPT-2, and RoBERTa significantly improves performance on various sequence generation tasks, achieving new state-of-the-art results.

## Contribution

It introduces a Transformer-based sequence-to-sequence model compatible with pre-trained checkpoints and provides an extensive empirical study on their effectiveness for sequence generation.

## Key findings

- Achieved new state-of-the-art on Machine Translation.
- Improved results on Text Summarization.
- Enhanced performance on Sentence Splitting and Fusion.

## Abstract

Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing. By warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We developed a Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT, GPT-2 and RoBERTa checkpoints and conducted an extensive empirical study on the utility of initializing our model, both encoder and decoder, with these checkpoints. Our models result in new state-of-the-art results on Machine Translation, Text Summarization, Sentence Splitting, and Sentence Fusion.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.12461/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/1907.12461/full.md

## References

57 references — full list in the complete paper: https://tomesphere.com/paper/1907.12461/full.md

---
Source: https://tomesphere.com/paper/1907.12461