Objective Metrics for Evaluating Large Language Models Using External Data Sources
Haoze Du, Richard Li, Edward Gehringer

TL;DR
This paper introduces an objective, automated framework for evaluating Large Language Models using external data sources, aiming to improve consistency, reproducibility, and reduce bias in performance assessments across various domains.
Contribution
It presents a novel evaluation framework that leverages external datasets and structured pipelines to objectively assess LLMs, addressing limitations of subjective methods.
Findings
Framework ensures consistent and reproducible measurements
Reduces reliance on human judgment and bias
Applicable across educational, scientific, and high-stakes domains
Abstract
Evaluating the performance of Large Language Models (LLMs) is a critical yet challenging task, particularly when aiming to avoid subjective assessments. This paper proposes a framework for leveraging subjective metrics derived from the class textual materials across different semesters to assess LLM outputs across various tasks. By utilizing well-defined benchmarks, factual datasets, and structured evaluation pipelines, the approach ensures consistent, reproducible, and bias-minimized measurements. The framework emphasizes automation and transparency in scoring, reducing reliance on human interpretation while ensuring alignment with real-world applications. This method addresses the limitations of subjective evaluation methods, providing a scalable solution for performance assessment in educational, scientific, and other high-stakes domains.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
