BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics
Dionizije Fa, Marko Culjak, Bruno Pandza, Mateo Cupic

TL;DR
BioAgent Bench is a new benchmark dataset and evaluation suite for assessing AI agents' performance and robustness in bioinformatics tasks, highlighting strengths and failure modes of current models.
Contribution
The paper introduces BioAgent Bench, a comprehensive benchmark for evaluating AI agents in bioinformatics, including stress testing and an LLM-based scoring system.
Findings
Frontier agents can complete multi-step pipelines reliably.
Robustness tests reveal failure modes under perturbations.
Open-weight models may be preferable in privacy-sensitive settings.
Abstract
This paper introduces BioAgent Bench, a benchmark dataset and an evaluation suite designed for measuring the performance and robustness of AI agents in common bioinformatics tasks. The benchmark contains curated end-to-end tasks (e.g., RNA-seq, variant calling, metagenomics) with prompts that specify concrete output artifacts to support automated assessment, including stress testing under controlled perturbations. We evaluate frontier closed-source and open-weight models across multiple agent harnesses, and use an LLM-based grader to score pipeline progress and outcome validity. We find that frontier agents can complete multi-step bioinformatics pipelines without elaborate custom scaffolding, often producing the requested final artifacts reliably. However, robustness tests reveal failure modes under controlled perturbations (corrupted inputs, decoy files, and prompt bloat), indicating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
