BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

Dionizije Fa; Marko Culjak; Bruno Pandza; Mateo Cupic

arXiv:2601.21800·cs.AI·May 8, 2026

BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

Dionizije Fa, Marko Culjak, Bruno Pandza, Mateo Cupic

PDF

TL;DR

BioAgent Bench is a new benchmark dataset and evaluation suite for assessing AI agents' performance and robustness in bioinformatics tasks, highlighting strengths and failure modes of current models.

Contribution

The paper introduces BioAgent Bench, a comprehensive benchmark for evaluating AI agents in bioinformatics, including stress testing and an LLM-based scoring system.

Findings

01

Frontier agents can complete multi-step pipelines reliably.

02

Robustness tests reveal failure modes under perturbations.

03

Open-weight models may be preferable in privacy-sensitive settings.

Abstract

This paper introduces BioAgent Bench, a benchmark dataset and an evaluation suite designed for measuring the performance and robustness of AI agents in common bioinformatics tasks. The benchmark contains curated end-to-end tasks (e.g., RNA-seq, variant calling, metagenomics) with prompts that specify concrete output artifacts to support automated assessment, including stress testing under controlled perturbations. We evaluate frontier closed-source and open-weight models across multiple agent harnesses, and use an LLM-based grader to score pipeline progress and outcome validity. We find that frontier agents can complete multi-step bioinformatics pipelines without elaborate custom scaffolding, often producing the requested final artifacts reliably. However, robustness tests reveal failure modes under controlled perturbations (corrupted inputs, decoy files, and prompt bloat), indicating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.