Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows

Shivam Rawat; Lucie Flek

arXiv:2604.25345·cs.AI·April 29, 2026

Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows

Shivam Rawat, Lucie Flek

PDF

TL;DR

This study evaluates agentic AI in astrophysical workflows, revealing that while performance is strong on well-defined tasks, silent failures and confident incorrect results pose significant risks, especially under complex reasoning conditions.

Contribution

The paper provides a systematic evaluation of agentic AI systems in astrophysics, highlighting failure modes and releasing an evaluation framework for reliability analysis.

Findings

01

Performance improves ~6x with domain context in one-shot tasks.

02

Silent failures often produce plausible but incorrect results.

03

Performance degrades on complex reasoning tasks, with confident errors.

Abstract

Agentic AI systems are increasingly being integrated into scientific workflows, yet their behavior under realistic conditions remains insufficiently understood. We evaluate CMBAgent across two workflow paradigms and eighteen astrophysical tasks. In the One-Shot setting, access to domain-specific context yields an approximately ~6x performance improvement (0.85 vs. ~0 without context), with the primary failure mode being silent incorrect computation - syntactically valid code that produces plausible but inaccurate results. In the Deep Research setting, the system frequently exhibits silent failures across stress tests, producing physically inconsistent posteriors without self-diagnosis. Overall, performance is strong on well-specified tasks but degrades on problems designed to probe reasoning limits, often without visible error signals. These findings highlight that the most concerning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.