Towards Large Reasoning Models for Agriculture
Hossein Zaremehrjerdi, Shreyan Ganguly, Ashlyn Rairdin, Elizabeth Tranel, Benjamin Feuer, Juan Ignacio Di Salvo, Srikanth Panthulugiri, Hernan Torres Pacin, Victoria Moser, Sarah Jones, Joscif G Raigne, Yanben Shen, Heidi M. Dornath, Aditya Balu, Adarsh Krishnamurthy

TL;DR
This paper introduces AgReason, a new benchmark and dataset for agricultural reasoning, demonstrating that large reasoning models outperform traditional LLMs in complex, domain-specific agricultural decision-making tasks.
Contribution
The paper presents AgReason and AgThoughts, pioneering datasets for agricultural reasoning, and develops AgThinker, small models that enhance LLMs' reasoning in agriculture.
Findings
LRMs outperform conventional models in agricultural reasoning
Gemini-based baseline achieves 36% accuracy
AgThoughts dataset improves reasoning capabilities in LLMs
Abstract
Agricultural decision-making involves complex, context-specific reasoning, where choices about crops, practices, and interventions depend heavily on geographic, climatic, and economic conditions. Traditional large language models (LLMs) often fall short in navigating this nuanced problem due to limited reasoning capacity. We hypothesize that recent advances in large reasoning models (LRMs) can better handle such structured, domain-specific inference. To investigate this, we introduce AgReason, the first expert-curated open-ended science benchmark with 100 questions for agricultural reasoning. Evaluations across thirteen open-source and proprietary models reveal that LRMs outperform conventional ones, though notable challenges persist, with the strongest Gemini-based baseline achieving 36% accuracy. We also present AgThoughts, a large-scale dataset of 44.6K question-answer pairs…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The AGREASON benchmark is a key strength. Its 100 questions were carefully curated and refined by domain experts, ensuring a high-quality, challenging evaluation. 2. This work evaluates a wide range of 18 recent open-source and proprietary models, providing a broad and up-to-date snapshot of current capabilities in this domain. The detailed, statement-level evaluation is far more insightful than simple metrics like ROUGE. 3. The authors demonstrate the value of AGTHOUGHTS by fine-tuning the A
1. The quality of the AGTHOUGHTS dataset hinges on a GPT-4.1-based filter that was designed based on expert feedback from only 200 examples. There is no quantitative analysis of this filter's accuracy (e.g., its agreement with human experts). Without this, it is difficult to gauge the level of noise, error, or bias that may have been introduced into the training data, potentially limiting the ultimate performance of models trained on it. 2. The evaluation relies solely on an LLM-as-Judge for ope
This paper targets an important question. Guidance in agriculture, especially context-aware analysis and solutions customized to specific situations, is very crucial. The generated dataset not only contain testing datasets to benchmark existing models, but also contain a large-scale training set contain CoT traces.
The generation process does not seem very valid. Specifically, the modifiers generate random instantiations independently, therefore very likely to result in invalid combinations, like the example shown in Figure 1. It is unclear how these invalid questions are filtered, either by human or by LLM. The introduced human validation is only applied to part of data sampled from the entire set. Although refinement is performed based on the validation, it is unclear whether the quality of the resulting
How to apply Large Reasoning Models (LRMs) to the domain of agriculture seems to be a research question that has the potential to make significant economic impacts but has not yet received much attention; the open-endedness of the proposed benchmark seems an improvement compared to existing ones.
While the datasets and models introduced by the paper are valuable for the intended community, the paper makes few technical contributions otherwise and may not be of interest to the general community. That said, I fully support publication of the paper in a relevant workshop at ICLR, if any.
1. Introduces AGREASON, the first expert-curated open-ended benchmark for agricultural reasoning (100 questions), addressing the lack of domain-specific evaluation tools. 2. Provides AGTHOUGHTS, a large-scale dataset (44.6K QA pairs) with synthetic reasoning traces, validated by agronomy experts, enabling fine-grained model training and evaluation. 3. Demonstrates the effectiveness of domain-specific fine-tuning: The AGTHINKER models (e.g., Phi-3 14B) achieve 13% accuracy on AGREASON, outperfo
1. AGREASON’s 100-question benchmark, while aligned with similar works (e.g., GPQA), may lack statistical power for robust generalization. Questions primarily cover U.S. states, limiting applicability to global agricultural contexts. 2. Human review sampled only 200 QA pairs (0.45% of the dataset), potentially overlooking systemic errors. 3. The manuscript does not convincingly demonstrate novelty, either in its data synthesis methods or in the benchmark design.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Sentiment Analysis and Opinion Mining
