SciCoQA: Quality Assurance for Scientific Paper--Code Alignment
Tim Baumg\"artner, Iryna Gurevych

TL;DR
SciCoQA introduces a dataset and analysis of the challenges faced by large language models in detecting discrepancies between scientific papers and their associated code, highlighting significant gaps in automated reproducibility verification.
Contribution
The paper presents SciCoQA, a new dataset for paper-code discrepancy detection, and analyzes the limitations of current LLMs in this task across multiple scientific domains.
Findings
Even the best LLMs detect only 46.7% of real discrepancies.
Models struggle with omitted details and long contexts.
Discrepancies are more challenging in papers outside pre-training data.
Abstract
Discrepancies between scientific papers and their code undermine reproducibility, a concern that grows as automated research agents scale scientific output beyond human review capacity. Whether LLMs can reliably detect such discrepancies has not been systematically measured. To this end, we present SciCoQA, a dataset of 635 paper-code discrepancies (92 real, 543 synthetic) for this cross-modal verification task. Across 22 evaluated models, even the best-performing LLMs, Gemini 3.1 Pro and GPT-5 Mini, detect only 46.7% of real-world discrepancies, revealing a critical gap in automated scientific quality assurance. We construct SciCoQA from GitHub issues and reproducibility papers, and propose a synthetic generation pipeline to scale beyond AI to Physics, Quantitative Biology, and other computational sciences. We further introduce a taxonomy of discrepancy types and categories to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
