KARL: Knowledge Agents via Reinforcement Learning

Jonathan D. Chang; Andrew Drozdov; Shubham Toshniwal; Owen Oertell; Alexander Trott; Jacob Portes; Abhay Gupta; Pallavi Koppol; Ashutosh Baheti; Sean Kulinski; Ivan Zhou; Irene Dea; Krista Opsahl-Ong; Simon Favreau-Lessard; Sean Owen; Jose Javier Gonzalez Ortiz; Arnav Singhvi; Xabi Andrade; Cindy Wang; Kartik Sreenivasan; Sam Havens; Jialu Liu; Peyton DeNiro; Wen Sun; Michael Bendersky; Jonathan Frankle

arXiv:2603.05218·cs.AI·March 6, 2026

KARL: Knowledge Agents via Reinforcement Learning

Jonathan D. Chang, Andrew Drozdov, Shubham Toshniwal, Owen Oertell, Alexander Trott, Jacob Portes, Abhay Gupta, Pallavi Koppol, Ashutosh Baheti, Sean Kulinski, Ivan Zhou, Irene Dea, Krista Opsahl-Ong, Simon Favreau-Lessard, Sean Owen, Jose Javier Gonzalez Ortiz, Arnav Singhvi

PDF

Open Access

TL;DR

KARL introduces a reinforcement learning-based system for training enterprise search agents that excel across diverse tasks, leveraging a new evaluation suite, synthetic data generation, and multi-task training for superior performance.

Contribution

The paper presents KARL, a novel reinforcement learning framework with a new evaluation suite, synthetic data generation pipeline, and a sample-efficient training paradigm for enterprise search agents.

Findings

01

KARL outperforms Claude 4.6 and GPT 5.2 on KARLBench.

02

Models trained on heterogeneous tasks generalize better.

03

KARL surpasses strong closed models with sufficient compute.

Abstract

We present a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. Our work makes four core contributions. First, we introduce KARLBench, a multi-capability evaluation suite spanning six distinct search regimes, including constraint-driven entity search, cross-document report synthesis, tabular numerical reasoning, exhaustive entity retrieval, procedural reasoning over technical documentation, and fact aggregation over internal enterprise notes. Second, we show that models trained across heterogeneous search behaviors generalize substantially better than those optimized for any single benchmark. Third, we develop an agentic synthesis pipeline that employs long-horizon reasoning and tool use to generate diverse, grounded, and high-quality training data, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Healthcare