Cross-domain benchmarks reveal when coordinated AI agents improve scientific inference from partial evidence
Fiona Y. Wong, Markus J. Buehler

TL;DR
This paper introduces a cross-domain benchmark to evaluate when coordinated AI agents enhance scientific inference from partial evidence, highlighting regimes where collaboration improves or does not improve performance.
Contribution
It presents a new benchmark spanning four scientific tasks to systematically assess the benefits of AI coordination in scientific inference.
Findings
Cross-channel composites improve over single-channel baselines when different disciplines capture parts of the phenomenon.
Coordination mainly improves interpretation and traceability when one signal dominates.
In molecular sonification, the gain is in representation, not predictive performance.
Abstract
Scientific evidence often spans instruments, databases, and disciplines, so no single source records the full phenomenon. This makes it difficult to determine when coordinated AI agents add value over simpler scientific workflows. We evaluate this question with a cross-domain benchmark spanning four scientific tasks: mapping molecular structure into musical representations, detecting historical paradigm shifts in science, identifying vector-borne disease emergence, and vetting transiting-exoplanet candidates. Each case uses a frozen evaluation panel, predefined scoring protocols, explicit baselines, ablations or null controls, and stated limitations. The results define three operating regimes. When different disciplines each capture only part of the phenomenon, cross-channel composites improve over single-channel baselines: climate-vector emergence reaches AUROC 0.944 and exoplanet…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
