AcademiClaw: When Students Set Challenges for AI Agents

Junjie Yu; Pengrui Lu; Weiye Si; Hongliang Lu; Jiabao Wu; Kaiwen Tao; Kun Wang; Lingyu Yang; Qiran Zhang; Xiuting Guo; Xuanyu Wang; Yang Wang; Yanjie Wang; Yi Yang; Zijian Hu; Ziyi Yang; Zonghan Zhou; Binghao Qiang; Borui Zhang; Chenning Li; Enchang Zhang; Feifan Chen; Feng Jian; Fengyin Sun; Hao Qiu; Hao Zheng; Haoran Zhu; Hongyu Liu; Jianbin Deng; Jiaxin Song; Jiaying Chi; Jiayou Shi; Jie Fang; Jinghui Zhong; Jingyu Zhou; Jinze Li; Junfeng Yi; Junyan Yu; Junzhi Xue; Ni Song; Pengyi Chen; Qi Chen; Quansheng Li; Rui Tao; Shenghai Gong; Shenhang Lu; Tianqi Shen; Tianxiang Zhu; Tiehan Kang; Tingyu Li; Wendi Wu; Xiao Shen; Xiao Zhou; Xiaotao Zhang; Xinrong Li; Xuankun Yang; Xun Zhang; Yan Li; Ye Lu; Yi Wang; Yibo Zhou; Yichi Zhang; Yihao Sun; Yijun Huang; Yixin Zhu; Yixuan Wu; Yuchen Sun; Yue Wu; Yuheng Sun; Yukun Li; Yutian Tu; Yuxuan Qin; Yuzhuo Wu; Zeyu Li; Zhengyu Lou; Zhenning Ran; Zizhu He; Pengfei Liu

arXiv:2605.02661·cs.AI·May 5, 2026

AcademiClaw: When Students Set Challenges for AI Agents

Junjie Yu, Pengrui Lu, Weiye Si, Hongliang Lu, Jiabao Wu, Kaiwen Tao, Kun Wang, Lingyu Yang, Qiran Zhang, Xiuting Guo, Xuanyu Wang, Yang Wang, Yanjie Wang, Yi Yang, Zijian Hu, Ziyi Yang, Zonghan Zhou, Binghao Qiang, Borui Zhang, Chenning Li, Enchang Zhang, Feifan Chen, Feng Jian

PDF

1 Repo

TL;DR

AcademiClaw is a new bilingual benchmark with 80 complex, real-world academic tasks from students, designed to evaluate and improve AI agents' capabilities across diverse domains.

Contribution

It introduces a comprehensive, expert-reviewed benchmark with detailed scoring and safety analysis, highlighting current AI limitations in academic tasks.

Findings

01

Best models achieve only 55% pass rate.

02

Sharp capability boundaries across domains are identified.

03

Divergent behavioral strategies among models are observed.

Abstract

Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

GAIR-NLP/AcademiClaw
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.