TL;DR
This paper evaluates large language models for biomedical coreference resolution, revealing their strengths and limitations, and compares generative approaches with traditional discriminative models using domain-specific prompts.
Contribution
It provides a comprehensive benchmark of LLMs on biomedical coreference resolution, introducing prompt-based techniques and comparing them with SpanBERT.
Findings
LLMs perform well on surface-level coreference tasks with domain prompts.
Long-range context and ambiguity remain challenging for LLMs.
Entity-augmented prompts improve LLM precision and F1 scores.
Abstract
Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs' performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries. We benchmark these approaches against a discriminative span-based encoder, SpanBERT, to compare the efficacy of generative versus discriminative methods. Our results demonstrate that while LLMs exhibit strong surface-level coreference capabilities, especially when supplemented with domain-grounding…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Timely and Relevant Research Question: The evaluation of LLMs on biomedical coreference resolution addresses a critical gap as these models become increasingly important in healthcare and life sciences applications. Comprehensive Prompting Strategy Evaluation: The four-tier evaluation framework (local context, contextual enrichment, domain cues, entity augmentation) provides systematic insights into how different types of information affect LLM performance in specialized domains. Practical Utili
w1: Limited Evaluation Scope The paper's evaluation is restricted to a single corpus (CRAFT), which significantly undermines the generalizability of findings. While CRAFT provides rich annotations for 67 full-text biomedical articles, this narrow scope fails to capture the diversity of biomedical text types, writing styles, and domain-specific challenges across different subfields (clinical notes, molecular biology, pharmacology). The lack of cross-corpus validation makes it impossible to determ
The paper takes on an interesting idea in benchmarking zero-shot LLMs on NLP tasks that previously required bespoke finetuning. While not wholly original, the idea has merit in showing capabilities of LLMs on domain-specific tasks that are not used to traditionally benchmark performance (ex. logic/mathematical reasoning). The paper very clearly defines each of the four experimental settings and presents the model performances in a digestible format.
The paper requires more justification of each of the task settings. A one-sentence justification for the purpose of each of the experimental settings that specifies what aspect of co-reference resolution each setting is evaluating would help to improve clarity. In addition, the authors frame the contributions of the paper as benchmarking LLMs against previous approaches by comparing several sizes of Llama against a 340-million parameter model that is not provided training. It is unclear what t
Strengths - Cleanly written paper. Easy to read and understand.
Weaknesses - Incomplete Metrics - Only coreference-level precision/recall reported; mention-level metrics missing. - SpanBERT baseline lacks precision/recall breakdown. - If these are already included, clear definitions are needed. - Insufficient Dataset Statistics - No details on average document length, chunk size, or total chunks per document. - Missing counts of mentions and coreference links detected by each model. - Limited Dataset Coverage - Evaluation restricted to CRAFT; ot
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
