Arcee Trinity Large Technical Report

Varun Singh; Lucas Krauss; Sami Jaghouar; Matej Sirovatka; Charles Goddard; Fares Obied; Jack Min Ong; Jannik Straube; Fern; Aria Harley; Conner Stewart; Colin Kealty; Maziyar Panahi; Simon Kirsten; Anushka Deshpande; Anneketh Vij; Arthur Bresnu; Pranav Veldurthi; Raghav Ravishankar; Hardik Bishnoi; DatologyAI Team; Arcee AI Team; Prime Intellect Team; Mark McQuade; Johannes Hagemann; Lucas Atkins

arXiv:2602.17004·cs.LG·February 20, 2026

Arcee Trinity Large Technical Report

Varun Singh, Lucas Krauss, Sami Jaghouar, Matej Sirovatka, Charles Goddard, Fares Obied, Jack Min Ong, Jannik Straube, Fern, Aria Harley, Conner Stewart, Colin Kealty, Maziyar Panahi, Simon Kirsten, Anushka Deshpande, Anneketh Vij, Arthur Bresnu, Pranav Veldurthi

PDF

Open Access 10 Models

TL;DR

This paper introduces Arcee Trinity models, a set of large sparse Mixture-of-Experts architectures with innovative training strategies, achieving high performance on extensive token datasets and providing open checkpoints.

Contribution

It presents new large-scale sparse MoE models with advanced architecture and a novel load balancing method, trained on trillions of tokens, and shares their checkpoints.

Findings

01

Models trained without loss spikes.

02

Trinity models achieve high parameter efficiency.

03

Open access to trained model checkpoints.

Abstract

We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token. Additionally, we report on Trinity Nano and Trinity Mini, with Trinity Nano having 6B total parameters with 1B activated per token, Trinity Mini having 26B total parameters with 3B activated per token. The models' modern architecture includes interleaved local and global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing for Mixture-of-Experts. For Trinity Large, we also introduce a new MoE load balancing strategy titled Soft-clamped Momentum Expert Bias Updates (SMEBU). We train the models using the Muon optimizer. All three models completed training with zero loss spikes. Trinity Nano and Trinity Mini were pre-trained on 10 trillion tokens, and Trinity Large was pre-trained on 17 trillion tokens. The model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsQuantum Chromodynamics and Particle Interactions · Computational Physics and Python Applications · High-Energy Particle Collisions Research