DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval
Maojun Sun, Yue Wu, Yifei Xie, Ruijian Han, Binyan Jiang, Defeng Sun, Yancheng Yuan, and Jian Huang

TL;DR
This paper introduces DARE, a distribution-aware retrieval model that enhances R package function retrieval for LLM agents by incorporating data distribution information, leading to improved accuracy and more reliable statistical analysis automation.
Contribution
The paper presents DARE, a novel embedding model that fuses data distribution features with function metadata, and introduces RPKB and RCodingAgent for improved R statistical workflow automation.
Findings
DARE achieves an NDCG@10 of 93.47%, outperforming existing models by up to 17%.
Integrating DARE into LLM agents significantly improves downstream statistical analysis tasks.
DARE uses fewer parameters while providing superior retrieval relevance.
Abstract
Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore data distribution, producing suboptimal matches. We propose DARE (Distribution-Aware Retrieval Embedding), a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval. Our main contributions are: (i) RPKB, a curated R Package Knowledge Base derived from 8,191 high-quality CRAN packages; (ii) DARE, an embedding model that fuses distributional features with function metadata to improve retrieval relevance; and (iii) RCodingAgent, an R-oriented LLM agent for reliable R code generation and a suite of statistical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Computational and Text Analysis Methods · Machine Learning in Materials Science
