CrossTrace: A Cross-Domain Dataset of Grounded Scientific Reasoning Traces for Hypothesis Generation
Andrew Bouras, OMS-II Research Fellow

TL;DR
CrossTrace is a large, multi-domain dataset of grounded scientific reasoning traces designed to improve hypothesis generation models across biomedical, AI/ML, and cross-domain research.
Contribution
It introduces the first large-scale, cross-domain dataset with step-level grounded reasoning traces, enabling better training of hypothesis-generation models.
Findings
Fine-tuning with CrossTrace significantly improves reasoning accuracy and structural compliance.
Balanced multi-domain training outperforms single-domain approaches.
Human validation confirms high grounding accuracy and no fabrication.
Abstract
Scientific hypothesis generation is a critical bottleneck in accelerating research, yet existing datasets for training and evaluating hypothesis-generating models are limited to single domains and lack explicit reasoning traces connecting prior knowledge to novel contributions. I introduce CrossTrace, a dataset of 1,389 grounded scientific reasoning traces spanning biomedical research (518), AI/ML (605), and cross-domain work (266). Each trace captures the structured reasoning chain from established knowledge through intermediate logical steps to a novel hypothesis, with every step grounded in source paper text. I define an Input/Trace/Output schema that extends the Bit-Flip-Spark framework of HypoGen with step-level verification, a taxonomy of eight discovery patterns, and multi-domain coverage. Fine-tuning Qwen2.5-7B-Instruct on CrossTrace via QLoRA yields substantial improvements…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
