The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios

Daocheng Fu; Jianbiao Mei; Rong Wu; Xuemeng Yang; Jia Xu; Ding Wang; Pinlong Cai; Yong Liu; Licheng Wen; Botian Shi

arXiv:2601.08173·cs.AI·January 14, 2026

The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios

Daocheng Fu, Jianbiao Mei, Rong Wu, Xuemeng Yang, Jia Xu, Ding Wang, Pinlong Cai, Yong Liu, Licheng Wen, Botian Shi

PDF

Open Access

TL;DR

This paper introduces extit{EvoEnv}, a dynamic benchmarking environment for evaluating multi-modal large language models in realistic workplace scenarios, focusing on scheduling, exploration, and continual learning challenges.

Contribution

It presents a novel environment for assessing agent robustness in dynamic, real-world tasks, highlighting deficiencies of current models and promoting more reliable, adaptable AI systems.

Findings

01

Current agents perform poorly in dynamic, uncertain environments.

02

Active exploration reduces hallucinations and improves decision-making.

03

Continuous learning strategies enhance adaptability in evolving tasks.

Abstract

The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real-world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce \method{}, a dynamic evaluation environment that simulates a "trainee" agent continuously exploring a novel setting. Unlike traditional benchmarks, \method{} evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Machine Learning and Algorithms · Explainable Artificial Intelligence (XAI)