BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs

Nourah M Salem; Elizabeth White; Michael Bada; Lawrence Hunter

arXiv:2510.25087·cs.CL·October 30, 2025

BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs

Nourah M Salem, Elizabeth White, Michael Bada, Lawrence Hunter

PDF

3 Reviews

TL;DR

This paper evaluates large language models for biomedical coreference resolution, revealing their strengths and limitations, and compares generative approaches with traditional discriminative models using domain-specific prompts.

Contribution

It provides a comprehensive benchmark of LLMs on biomedical coreference resolution, introducing prompt-based techniques and comparing them with SpanBERT.

Findings

01

LLMs perform well on surface-level coreference tasks with domain prompts.

02

Long-range context and ambiguity remain challenging for LLMs.

03

Entity-augmented prompts improve LLM precision and F1 scores.

Abstract

Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs' performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries. We benchmark these approaches against a discriminative span-based encoder, SpanBERT, to compare the efficacy of generative versus discriminative methods. Our results demonstrate that while LLMs exhibit strong surface-level coreference capabilities, especially when supplemented with domain-grounding…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 3

Strengths

Timely and Relevant Research Question: The evaluation of LLMs on biomedical coreference resolution addresses a critical gap as these models become increasingly important in healthcare and life sciences applications. Comprehensive Prompting Strategy Evaluation: The four-tier evaluation framework (local context, contextual enrichment, domain cues, entity augmentation) provides systematic insights into how different types of information affect LLM performance in specialized domains. Practical Utili

Weaknesses

w1: Limited Evaluation Scope The paper's evaluation is restricted to a single corpus (CRAFT), which significantly undermines the generalizability of findings. While CRAFT provides rich annotations for 67 full-text biomedical articles, this narrow scope fails to capture the diversity of biomedical text types, writing styles, and domain-specific challenges across different subfields (clinical notes, molecular biology, pharmacology). The lack of cross-corpus validation makes it impossible to determ

Reviewer 02Rating 2Confidence 3

Strengths

The paper takes on an interesting idea in benchmarking zero-shot LLMs on NLP tasks that previously required bespoke finetuning. While not wholly original, the idea has merit in showing capabilities of LLMs on domain-specific tasks that are not used to traditionally benchmark performance (ex. logic/mathematical reasoning). The paper very clearly defines each of the four experimental settings and presents the model performances in a digestible format.

Weaknesses

The paper requires more justification of each of the task settings. A one-sentence justification for the purpose of each of the experimental settings that specifies what aspect of co-reference resolution each setting is evaluating would help to improve clarity. In addition, the authors frame the contributions of the paper as benchmarking LLMs against previous approaches by comparing several sizes of Llama against a 340-million parameter model that is not provided training. It is unclear what t

Reviewer 03Rating 2Confidence 4

Strengths

Strengths - Cleanly written paper. Easy to read and understand.

Weaknesses

Weaknesses - Incomplete Metrics - Only coreference-level precision/recall reported; mention-level metrics missing. - SpanBERT baseline lacks precision/recall breakdown. - If these are already included, clear definitions are needed. - Insufficient Dataset Statistics - No details on average document length, chunk size, or total chunks per document. - Missing counts of mentions and coreference links detected by each model. - Limited Dataset Coverage - Evaluation restricted to CRAFT; ot

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.