PaperQA: Retrieval-Augmented Generative Agent for Scientific Research
Jakub L\'ala, Odhran O'Donoghue, Aleksandar Shtedritski, Sam Cox,, Samuel G. Rodriques, Andrew D. White

TL;DR
PaperQA is a retrieval-augmented generative model that retrieves and synthesizes information from scientific literature, outperforming existing models and matching expert researchers on complex science question-answering benchmarks.
Contribution
The paper introduces PaperQA, a novel retrieval-augmented question answering agent for scientific literature, and a new benchmark LitQA for evaluating scientific research capabilities.
Findings
PaperQA exceeds existing LLMs on science QA benchmarks.
PaperQA matches expert human researchers on LitQA.
LitQA is a new challenging benchmark for scientific literature comprehension.
Abstract
Large Language Models (LLMs) generalize well across language tasks, but suffer from hallucinations and uninterpretability, making it difficult to assess their accuracy without ground-truth. Retrieval-Augmented Generation (RAG) models have been proposed to reduce hallucinations and provide provenance for how an answer was generated. Applying such models to the scientific literature may enable large-scale, systematic processing of scientific knowledge. We present PaperQA, a RAG agent for answering questions over the scientific literature. PaperQA is an agent that performs information retrieval across full-text scientific articles, assesses the relevance of sources and passages, and uses RAG to provide answers. Viewing this agent as a question answering model, we find it exceeds performance of existing LLMs and LLM agents on current science QA benchmarks. To push the field closer to how…
Peer Reviews
Decision·Submitted to ICLR 2024
The concept introduced in the paper is promising, as it aims to develop a framework for retrieving literature to facilitate the answering of questions within scientific texts. The authors propose a novel approach that breaks down the QA task into three primary components: identifying relevant papers from online databases, extracting text from these papers, and synthesizing the information into a coherent final answer. The paper introduces a new dataset, LitQA, which necessitates the retrieval a
Lack of Novelty: The methodology presented in this paper follows the established pipeline of retrieval, reading, and answering, which has been extensively explored in prior literature. The paper does not adequately differentiate the proposed model from existing work in the field. For this approach to be considered a substantial contribution, it would require either a novel application of these methods or significant improvements over existing models, neither of which are sufficiently demonstrat
* Authors introduce new components to the standard RAG pipeline (e.g., search, map-reduce the summary, repeat for more evidence) * Adaptive and modular framework and an implementation with open source libraries.
* This paper appeared to be more product or application specific than focused on the underlying research problems. Unfortunately, no research problem was mentioned in the text. * Ask LLM prompt assess the parametric knowledge, which is feed to the evidence contexts. Authors found this knowledge is helpful. But I do not agree with the reasons provided. For example, what would happen if there are knowledge conflicts raised with the parametric knowledge and retrieved knowledge? > “Surprisingly enou
1. PaperQA decomposes parts of a RAG and provides them as tools to an agent, and it can adjust the input to paper searches, gather evidence with different phrases, and assess if an answer is complete. 2. PaperQA makes use of a priori and a posteriori prompting, tapping into the latent knowledge in LLMs. 3. PaperQA outperforms all models tested and commercial tools, and is comparable to human experts on LitQA on performance and time.
1. The paper has some innovation, but it still feels limited. Firstly, the dynamic use of the three tools is quite similar to the ReAct framework, all of which are dynamically autonomous in determining whether to retrieve them again. Secondly, the number of benchmarks constructed is relatively small, with only 50 questions and a multiple-choice format. Existing research has shown that the form of multiple choice questions has limitations in evaluating model performance, and the model is more oft
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies
MethodsMulti-Head Attention · Attention Is All You Need · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Dense Connections · Attention Dropout · Adam · BART · Linear Warmup With Linear Decay
