Enhancing Scientific Reproducibility Through Automated BioCompute Object Creation Using Retrieval-Augmented Generation from Publications
Sean Kim, Raja Mazumder

TL;DR
This paper introduces an automated method using Retrieval-Augmented Generation and Large Language Models to create BioCompute Objects from scientific publications, aiming to improve reproducibility and reduce documentation effort.
Contribution
It presents a novel BCO assistant tool that automates BCO creation from papers using RAG and LLMs, addressing challenges like hallucination and long-context understanding.
Findings
Significantly reduces documentation time for bioinformatics research.
Maintains compliance with BCO standards through optimized retrieval and prompting.
Demonstrates potential to enhance scientific reproducibility and knowledge extraction.
Abstract
The exponential growth in computational power and accessibility has transformed the complexity and scale of bioinformatics research, necessitating standardized documentation for transparency, reproducibility, and regulatory compliance. The IEEE BioCompute Object (BCO) standard addresses this need but faces adoption challenges due to the overhead of creating compliant documentation, especially for legacy research. This paper presents a novel approach to automate the creation of BCOs from scientific papers using Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs). We describe the development of the BCO assistant tool that leverages RAG to extract relevant information from source papers and associated code repositories, addressing key challenges such as LLM hallucination and long-context understanding. The implementation incorporates optimized retrieval processes,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Biomedical Text Mining and Ontologies · Research Data Management Practices
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Attention Dropout · Dense Connections · Multi-Head Attention · Linear Warmup With Linear Decay · Weight Decay · Adam · WordPiece
