Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge
Dhaval Patel, Chathurangi Shyalika, Suryanarayana Reddy Yarrabothula, Ling Yue, Shuxin Lin, Nianjun Zhou, James Rayfield

TL;DR
This paper analyzes the CODS 2025 AssetOpsBench challenge, revealing insights about leaderboard saturation, evaluation effects, scoring contributions, team dynamics, and successful strategies in industrial multi-agent orchestration competitions.
Contribution
It provides a comprehensive retrospective analysis of the challenge, highlighting key findings about evaluation metrics, team behavior, and scoring impacts to inform future competition design.
Findings
Public leaderboard saturates at 72.73%, richer prompts do not improve peak.
Hidden evaluation alters performance interpretation, with negative correlation in execution scores.
The match{} term has minimal impact on overall scores, affecting top team rankings.
Abstract
Competition retrospectives are useful when they explain what a leaderboard measured, how hidden evaluation changed conclusions, and which design patterns were rewarded. We revisit the CODS 2025 \assetopslive{} challenge, a privacy-aware Codabench competition on industrial multi-agent orchestration built on \assetops{}. We combine final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, the organizer winners report, the companion \assetopslive{} system paper, and verified planning-track source trees. Five results stand out. First, the public planning leaderboard saturates at 72.73\%, and richer prompts do not improve that peak. Second, hidden evaluation changes the story: public and private scores correlate moderately in planning () but negatively in execution (), with several 45.45\% public execution systems reaching 63.64\%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
