Towards LLM-Powered Task-Aware Retrieval of Scientific Workflows for Galaxy
Shamse Tasnim Cynthia, Banani Roy

TL;DR
This paper introduces a task-aware retrieval system for Galaxy workflows that combines dense vector search with LLM-based reranking, significantly improving relevance and accuracy for semantic queries in scientific workflow retrieval.
Contribution
It presents a novel two-stage retrieval framework integrating embedding models and instruction-tuned LLMs, along with a benchmark dataset for evaluation within the Galaxy ecosystem.
Findings
Significant improvement in top-k accuracy and relevance for complex queries
Effective use of LLM reranking to enhance workflow retrieval quality
First systematic evaluation of retrieval methods in Galaxy workflows
Abstract
Scientific Workflow Management Systems (SWfMSs) such as Galaxy have become essential infrastructure in bioinformatics, supporting the design, execution, and sharing of complex multi-step analyses. Despite hosting hundreds of reusable workflows across domains, Galaxy's current keyword-based retrieval system offers limited support for semantic query interpretation and often fails to surface relevant workflows when exact term matches are absent. To address this gap, we propose a task-aware, two-stage retrieval framework that integrates dense vector search with large language model (LLM)-based reranking. Our system first retrieves candidate workflows using state-of-the-art embedding models and then reranks them using instruction-tuned generative LLMs (GPT-4o, Mistral-7B) based on semantic task alignment. To support robust evaluation, we construct a benchmark dataset of Galaxy workflows…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Biomedical Text Mining and Ontologies · Machine Learning in Materials Science
