Apriel-H1: Towards Efficient Enterprise Reasoning Models

Oleksiy Ostapenko; Luke Kumar; Raymond Li; Denis Kocetkov; Joel Lamy-Poirier; Shruthan Radhakrishna; Soham Parikh; Shambhavi Mishra; Sebastien Paquet; Srinivas Sunkara; Val\'erie B\'ecaert; Sathwik Tejaswi Madhusudhan; Torsten Scholak

arXiv:2511.02651·cs.LG·November 5, 2025

Apriel-H1: Towards Efficient Enterprise Reasoning Models

Oleksiy Ostapenko, Luke Kumar, Raymond Li, Denis Kocetkov, Joel Lamy-Poirier, Shruthan Radhakrishna, Soham Parikh, Shambhavi Mishra, Sebastien Paquet, Srinivas Sunkara, Val\'erie B\'ecaert, Sathwik Tejaswi Madhusudhan, Torsten Scholak

PDF

Open Access

TL;DR

This paper introduces Apriel-H1, a hybrid model combining transformers and state space models for efficient reasoning, achieving over 2x inference throughput with minimal performance loss.

Contribution

The paper presents a novel hybrid architecture that replaces parts of transformer attention with linear state space models, enabling scalable and efficient reasoning at 15B scale.

Findings

01

Hybrid models achieve over 2x inference throughput.

02

Replacing attention layers with SSMs minimally impacts reasoning quality.

03

Incremental distillation effectively combines transformer and SSM components.

Abstract

Large Language Models (LLMs) achieve remarkable reasoning capabilities through transformer architectures with attention mechanisms. However, transformers suffer from quadratic time and memory complexity in the attention module (MHA) and require caching key-value states during inference, which severely limits throughput and scalability. High inference throughput is critical for agentic tasks, long-context reasoning, efficient deployment under high request loads, and more efficient test-time compute scaling. State Space Models (SSMs) such as Mamba offer a promising alternative with linear inference complexity and a constant memory footprint via recurrent computation with fixed-size hidden states. In this technical report we introduce the Apriel-H1 family of hybrid LLMs that combine transformer attention and SSM sequence mixers for efficient reasoning at 15B model size. These models are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Machine Learning in Healthcare