Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows
Shivam Rawat, Lucie Flek

TL;DR
This study evaluates agentic AI in astrophysical workflows, revealing that while performance is strong on well-defined tasks, silent failures and confident incorrect results pose significant risks, especially under complex reasoning conditions.
Contribution
The paper provides a systematic evaluation of agentic AI systems in astrophysics, highlighting failure modes and releasing an evaluation framework for reliability analysis.
Findings
Performance improves ~6x with domain context in one-shot tasks.
Silent failures often produce plausible but incorrect results.
Performance degrades on complex reasoning tasks, with confident errors.
Abstract
Agentic AI systems are increasingly being integrated into scientific workflows, yet their behavior under realistic conditions remains insufficiently understood. We evaluate CMBAgent across two workflow paradigms and eighteen astrophysical tasks. In the One-Shot setting, access to domain-specific context yields an approximately ~6x performance improvement (0.85 vs. ~0 without context), with the primary failure mode being silent incorrect computation - syntactically valid code that produces plausible but inaccurate results. In the Deep Research setting, the system frequently exhibits silent failures across stress tests, producing physically inconsistent posteriors without self-diagnosis. Overall, performance is strong on well-specified tasks but degrades on problems designed to probe reasoning limits, often without visible error signals. These findings highlight that the most concerning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
