SSA: Improving Performance With a Better Scoring Function
Omar Naim, Swarnadeep Bhar, J\'er\^ome Bolte, Nicholas Asher

TL;DR
This paper introduces SSA, a new attention scoring function that enhances transformer models' generalization and performance on NLP tasks by addressing limitations of the traditional Softmax scoring method.
Contribution
The paper proposes Scaled Signed Averaging (SSA), a novel attention scoring function that improves transformer performance and generalization on various NLP benchmarks and tasks.
Findings
SSA outperforms Softmax in transformer models on NLP benchmarks.
SSA improves in-context learning capabilities of transformers.
Transformer models with SSA show better linguistic probing results.
Abstract
While transformer models exhibit strong in-context learning (ICL) abilities, they often fail to generalize under simple distribution shifts. We analyze these failures and identify Softmax, the scoring function in the attention mechanism, as a contributing factor. We propose \textbf{Scaled Signed Averaging (SSA)}, a novel attention scoring function that mitigates these failures. SSA significantly improves performance on our ICL tasks and outperforms transformer models with Softmax on several NLP benchmarks and linguistic probing tasks, in both decoder-only and encoder-only architectures.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
