Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge

Dhaval Patel; Chathurangi Shyalika; Suryanarayana Reddy Yarrabothula; Ling Yue; Shuxin Lin; Nianjun Zhou; James Rayfield

arXiv:2605.08518·cs.AI·May 12, 2026

Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge

Dhaval Patel, Chathurangi Shyalika, Suryanarayana Reddy Yarrabothula, Ling Yue, Shuxin Lin, Nianjun Zhou, James Rayfield

PDF

TL;DR

This paper analyzes the CODS 2025 AssetOpsBench challenge, revealing insights about leaderboard saturation, evaluation effects, scoring contributions, team dynamics, and successful strategies in industrial multi-agent orchestration competitions.

Contribution

It provides a comprehensive retrospective analysis of the challenge, highlighting key findings about evaluation metrics, team behavior, and scoring impacts to inform future competition design.

Findings

01

Public leaderboard saturates at 72.73%, richer prompts do not improve peak.

02

Hidden evaluation alters performance interpretation, with negative correlation in execution scores.

03

The match{} term has minimal impact on overall scores, affecting top team rankings.

Abstract

Competition retrospectives are useful when they explain what a leaderboard measured, how hidden evaluation changed conclusions, and which design patterns were rewarded. We revisit the CODS 2025 \assetopslive{} challenge, a privacy-aware Codabench competition on industrial multi-agent orchestration built on \assetops{}. We combine final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, the organizer winners report, the companion \assetopslive{} system paper, and verified planning-track source trees. Five results stand out. First, the public planning leaderboard saturates at 72.73\%, and richer prompts do not improve that peak. Second, hidden evaluation changes the story: public and private scores correlate moderately in planning ( $r = 0.69$ ) but negatively in execution ( $r = - 0.13$ ), with several 45.45\% public execution systems reaching 63.64\%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.