TL;DR
This paper introduces cost-effective, open-source reward models designed for evaluating scientific writing, capable of generalizing across diverse tasks and criteria without task-specific retraining.
Contribution
It presents a novel two-stage training framework and multi-aspect evaluation design to improve scientific writing assessment by LLM-based reward models.
Findings
Models outperform existing benchmarks in scientific writing evaluation.
The approach generalizes well to unseen evaluation settings.
Training improves reasoning and multi-aspect assessment capabilities.
Abstract
Scientific writing is an expert-domain task that demands deep domain knowledge, task-specific requirements and reasoning capabilities that leverage the domain knowledge to satisfy the task specifications. While scientific text generation has been widely studied, its evaluation remains a challenging and open problem. It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks while adhering to their distinct requirements. However, existing LLM-based judges and reward models are primarily optimized for general-purpose benchmarks with fixed scoring rubrics and evaluation criteria. Consequently, they often fail to reason over sparse knowledge of scientific domains when interpreting task-dependent and multi-faceted criteria. Moreover, fine-tuning for each individual task is costly and impractical for low-resource settings. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
