Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

Roger Creus Castanyer; Pablo Samuel Castro; Glen Berseth

arXiv:2605.06869·cs.AI·May 14, 2026

Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

Roger Creus Castanyer, Pablo Samuel Castro, Glen Berseth

PDF

TL;DR

Agentick is a comprehensive benchmark for evaluating diverse sequential decision-making agents across multiple modalities, tasks, and difficulty levels, facilitating fair comparison and progress in AI research.

Contribution

It introduces a unified, multi-modal benchmark with extensive evaluation tools, enabling fair comparison of RL, LLM, VLM, hybrid, and human agents in sequential decision-making.

Findings

01

No single approach dominates across all tasks.

02

GPT-5 mini outperforms others overall.

03

Reasoning harness significantly boosts LLM performance.

Abstract

AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre-trained knowledge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision-making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre-built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.