PaSa: An LLM Agent for Comprehensive Academic Paper Search
Yichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, Weinan E

TL;DR
PaSa is an LLM-powered agent designed for comprehensive academic paper search, outperforming existing methods in recall and precision on real-world queries despite being trained on synthetic data.
Contribution
Introduction of PaSa, a novel LLM-based academic paper search agent optimized with reinforcement learning and evaluated on real-world queries, demonstrating superior performance.
Findings
PaSa outperforms existing baselines in recall and precision.
PaSa-7B surpasses Google with GPT-4o by over 37% in recall@20.
PaSa trained on synthetic data generalizes well to real-world queries.
Abstract
We introduce PaSa, an advanced Paper Search agent powered by large language models. PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholar queries. We optimize PaSa using reinforcement learning with a synthetic dataset, AutoScholarQuery, which includes 35k fine-grained academic queries and corresponding papers sourced from top-tier AI conference publications. Additionally, we develop RealScholarQuery, a benchmark collecting real-world academic queries to assess PaSa performance in more realistic scenarios. Despite being trained on synthetic data, PaSa significantly outperforms existing baselines on RealScholarQuery, including Google, Google Scholar, Google with GPT-4o for paraphrased queries, ChatGPT (search-enabled GPT-4o), GPT-o1,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Semantic Web and Ontologies · Educational Technology and Assessment
MethodsAttention Is All You Need · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · Absolute Position Encodings · Multi-Head Attention · Position-Wise Feed-Forward Layer
