SSA: Improving Performance With a Better Scoring Function

Omar Naim; Swarnadeep Bhar; J\'er\^ome Bolte; Nicholas Asher

arXiv:2508.14685·cs.CL·May 12, 2026

SSA: Improving Performance With a Better Scoring Function

Omar Naim, Swarnadeep Bhar, J\'er\^ome Bolte, Nicholas Asher

PDF

TL;DR

This paper introduces SSA, a new attention scoring function that enhances transformer models' generalization and performance on NLP tasks by addressing limitations of the traditional Softmax scoring method.

Contribution

The paper proposes Scaled Signed Averaging (SSA), a novel attention scoring function that improves transformer performance and generalization on various NLP benchmarks and tasks.

Findings

01

SSA outperforms Softmax in transformer models on NLP benchmarks.

02

SSA improves in-context learning capabilities of transformers.

03

Transformer models with SSA show better linguistic probing results.

Abstract

While transformer models exhibit strong in-context learning (ICL) abilities, they often fail to generalize under simple distribution shifts. We analyze these failures and identify Softmax, the scoring function in the attention mechanism, as a contributing factor. We propose \textbf{Scaled Signed Averaging (SSA)}, a novel attention scoring function that mitigates these failures. SSA significantly improves performance on our ICL tasks and outperforms transformer models with Softmax on several NLP benchmarks and linguistic probing tasks, in both decoder-only and encoder-only architectures.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.