Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
Roger Creus Castanyer, Pablo Samuel Castro, Glen Berseth

TL;DR
Agentick is a comprehensive benchmark for evaluating diverse sequential decision-making agents across multiple modalities, tasks, and difficulty levels, facilitating fair comparison and progress in AI research.
Contribution
It introduces a unified, multi-modal benchmark with extensive evaluation tools, enabling fair comparison of RL, LLM, VLM, hybrid, and human agents in sequential decision-making.
Findings
No single approach dominates across all tasks.
GPT-5 mini outperforms others overall.
Reasoning harness significantly boosts LLM performance.
Abstract
AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre-trained knowledge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision-making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre-built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
