Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025
Samy Ateia, Udo Kruschwitz

TL;DR
This paper investigates whether Large Language Models can improve biomedical search tasks through self-generated feedback, comparing reasoning and non-reasoning models in an expert question-answering challenge.
Contribution
It introduces a self-feedback mechanism for LLMs in biomedical retrieval tasks and evaluates its effectiveness across different models and question types.
Findings
Self-feedback shows varied performance across models and tasks.
Reasoning models may generate more useful feedback than non-reasoning models.
Insights into LLM self-correction for domain-specific professional search.
Abstract
Agentic Retrieval Augmented Generation (RAG) and 'deep research' systems aim to enable autonomous search processes where Large Language Models (LLMs) iteratively refine outputs. However, applying these systems to domain-specific professional search, such as biomedical research, presents challenges, as automated systems may reduce user involvement and misalign with expert information needs. Professional search tasks often demand high levels of user expertise and transparency. The BioASQ CLEF 2025 challenge, using expert-formulated questions, can serve as a platform to study these issues. We explored the performance of current reasoning and nonreasoning LLMs like Gemini-Flash 2.0, o3-mini, o4-mini and DeepSeek-R1. A key aspect of our methodology was a self-feedback mechanism where LLMs generated, evaluated, and then refined their outputs for query expansion and for multiple answer types…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
