Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
Halley Young, Nikolaj Bj\"orner

TL;DR
This paper introduces Comet-H, an iterative system that orchestrates language model-driven research software development, addressing issues like hallucination and desynchronization to improve consistency and reliability.
Contribution
It presents a novel prompt automation framework with a transparent scoring mechanism, enabling coordinated ideation, implementation, and evaluation in research software projects.
Findings
Achieved an F1 score of 0.768 on a static analysis benchmark.
Developed 46 research-software repositories across multiple domains.
Found audit-and-contraction passes dominate successful development trajectories.
Abstract
Large language models can now generate substantial code and draft research text, but research-software projects require more than either artifact alone. The mathematical thesis, executable system, benchmark surface, and public claims must mature together, yet often drift apart. We identify two LM-specific failure modes: hallucination accumulation, in which claims exceed what code or theory supports and unsupported assertions propagate across sessions; and desynchronization, in which code, theory, or the model's own world model fall out of alignment. We propose Comet-H, an iterative prompt automaton that orchestrates ideation, implementation, evaluation, grounding, and paper-writing as coupled coordinates of a single workspace state. At each step, a controller selects the next prompt by scoring it against what the workspace currently lacks, carries unfinished follow-up work forward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
