TL;DR
This paper investigates how cross-domain seed retrieval influences LLM ideation, finding that diversity helps but semantic relevance is not yet reliably exploited, with tools and datasets released for further research.
Contribution
It introduces a three-stage pipeline for seed extraction, retrieval, and synthesis, demonstrating the impact of diverse seed exposure on LLM ideation.
Findings
Cross-domain retrieval increases seed diversity and novelty.
Tool-augmented extraction improves seed specificity.
Diverse seeds enhance ideation but semantic relevance is underutilized.
Abstract
The discovery of novel methodologies for emerging problems is a continuing cycle in ML, often driven by the migration of techniques across domains. Building on this observation, we ask whether current LLM ideation systems benefit from targeted cross-domain retrieval or simply from exposure to diverse mechanisms. We study this question through PaperGym, a three-stage pipeline: (1) tool-augmented seed extraction via read, grep, and bash over an isolated paper environment, (2) cross-domain seed retrieval via paraphrasing across seven ML domains, and (3) method synthesis from retrieved seeds, each scored by rubric-based judges. Tool-augmented extraction improves specificity, and paraphrase-based retrieval broadens domain coverage. In synthesis, cross-domain retrieval receives more pairwise novelty wins than no-retrieval and same-domain baselines, but shows no significant difference from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
