Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models

Vijay Sadashivaiah; Georgios Dasoulas; Judith Mueller; Soumya Ghosh

arXiv:2604.27124·cs.LG·May 1, 2026

Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models

Vijay Sadashivaiah, Georgios Dasoulas, Judith Mueller, Soumya Ghosh

PDF

1 Repo

TL;DR

This paper introduces sigmoid attention as a stable, faster alternative to softmax attention for biological foundation models, demonstrating improved performance and training stability on single-cell datasets.

Contribution

It presents sigmoid attention as a theoretically grounded, empirically superior replacement for softmax in biological models, with an efficient GPU implementation.

Findings

01

Sigmoid attention achieves 25% higher cell-type separation.

02

Models with sigmoid attention train up to 10% faster.

03

Sigmoid attention remains stable where softmax diverges.

Abstract

Training stable biological foundation models requires rethinking attention mechanisms: we find that using sigmoid attention as a drop in replacement for softmax attention a) produces better learned representations: on six diverse single-cell datasets, sigmoid achieves 25% higher cell-type separation, better cell-type cohesion metrics, and lower validation loss, b) faster training, models with sigmoid attention train up to 10% faster than their softmax counterparts, and c) more stable training by eliminating inherent sources of instability in softmax attention. We establish that sigmoid attention has globally bounded derivatives ( $\leq 0.25$ ) as opposed to softmax, and a diagonal Jacobian structure in contrast with softmax's dense coupling, which together help alleviate training instabilities. In stress tests on 160M-parameter bidirectional attention models trained without gradient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MSDLLCpapers/triton-sigmoid
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.