# Is it Time to Swish? Comparing Deep Learning Activation Functions Across   NLP tasks

**Authors:** Steffen Eger, Paul Youssef, Iryna Gurevych

arXiv: 1901.02671 · 2019-01-10

## TL;DR

This paper conducts a comprehensive comparison of 21 activation functions across eight NLP tasks, revealing the penalized tanh as the most stable and effective, and demonstrating its benefits in LSTM gates.

## Contribution

It provides the first large-scale evaluation of diverse activation functions in NLP, highlighting the penalized tanh's stability and effectiveness, and improving LSTM performance.

## Key findings

- Penalized tanh is the most stable activation function across NLP tasks.
- Replacing sigmoid and tanh gates with penalized tanh in LSTMs improves performance by 2 percentage points.
- Most existing activation functions are less stable than penalized tanh in NLP applications.

## Abstract

Activation functions play a crucial role in neural networks because they are the nonlinearities which have been attributed to the success story of deep learning. One of the currently most popular activation functions is ReLU, but several competitors have recently been proposed or 'discovered', including LReLU functions and swish. While most works compare newly proposed activation functions on few tasks (usually from image classification) and against few competitors (usually ReLU), we perform the first large-scale comparison of 21 activation functions across eight different NLP tasks. We find that a largely unknown activation function performs most stably across all tasks, the so-called penalized tanh function. We also show that it can successfully replace the sigmoid and tanh gates in LSTM cells, leading to a 2 percentage point (pp) improvement over the standard choices on a challenging NLP task.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1901.02671/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/1901.02671/full.md

## References

27 references — full list in the complete paper: https://tomesphere.com/paper/1901.02671/full.md

---
Source: https://tomesphere.com/paper/1901.02671