# Scheduled Sampling for Transformers

**Authors:** Tsvetomila Mihaylova, Andr\'e F. T. Martins

arXiv: 1906.07651 · 2019-06-28

## TL;DR

This paper adapts scheduled sampling, a technique to reduce exposure bias, for Transformer models using a two-pass decoding strategy, showing promising results close to traditional training methods.

## Contribution

It introduces a novel structural modification enabling scheduled sampling in Transformers, which was previously challenging due to their full-sentence attention mechanism.

## Key findings

- Performance close to teacher-forcing baseline
- Effective reduction of exposure bias in Transformers
- Promising results for future research

## Abstract

Scheduled sampling is a technique for avoiding one of the known problems in sequence-to-sequence generation: exposure bias. It consists of feeding the model a mix of the teacher forced embeddings and the model predictions from the previous step in training time. The technique has been used for improving the model performance with recurrent neural networks (RNN). In the Transformer model, unlike the RNN, the generation of a new word attends to the full sentence generated so far, not only to the last word, and it is not straightforward to apply the scheduled sampling technique. We propose some structural changes to allow scheduled sampling to be applied to Transformer architecture, via a two-pass decoding strategy. Experiments on two language pairs achieve performance close to a teacher-forcing baseline and show that this technique is promising for further exploration.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1906.07651/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/1906.07651/full.md

## References

17 references — full list in the complete paper: https://tomesphere.com/paper/1906.07651/full.md

---
Source: https://tomesphere.com/paper/1906.07651