SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

Tim Baumg\"artner; Iryna Gurevych

arXiv:2601.12910·cs.CL·April 23, 2026

SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

Tim Baumg\"artner, Iryna Gurevych

PDF

1 Repo 1 Datasets

TL;DR

SciCoQA introduces a dataset and analysis of the challenges faced by large language models in detecting discrepancies between scientific papers and their associated code, highlighting significant gaps in automated reproducibility verification.

Contribution

The paper presents SciCoQA, a new dataset for paper-code discrepancy detection, and analyzes the limitations of current LLMs in this task across multiple scientific domains.

Findings

01

Even the best LLMs detect only 46.7% of real discrepancies.

02

Models struggle with omitted details and long contexts.

03

Discrepancies are more challenging in papers outside pre-training data.

Abstract

Discrepancies between scientific papers and their code undermine reproducibility, a concern that grows as automated research agents scale scientific output beyond human review capacity. Whether LLMs can reliably detect such discrepancies has not been systematically measured. To this end, we present SciCoQA, a dataset of 635 paper-code discrepancies (92 real, 543 synthetic) for this cross-modal verification task. Across 22 evaluated models, even the best-performing LLMs, Gemini 3.1 Pro and GPT-5 Mini, detect only 46.7% of real-world discrepancies, revealing a critical gap in automated scientific quality assurance. We construct SciCoQA from GitHub issues and reproducibility papers, and propose a synthetic generation pipeline to scale beyond AI to Physics, Quantitative Biology, and other computational sciences. We further introduce a taxonomy of discrepancy types and categories to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ukplab/scicoqa
github

Datasets

UKPLab/scicoqa
dataset· 227 dl
227 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.