Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making
Oluyemi Enoch Amujo, Shanchieh Jay Yang

TL;DR
This paper evaluates large language models across multiple domains using a comprehensive framework, introducing a novel outlier detection method to improve benchmarking and inform fine-tuning decisions.
Contribution
It presents a new evaluation methodology and the ThroughCut outlier detection technique, enhancing the reliability of LLM benchmarking across diverse domains.
Findings
Model size and prompt type significantly affect response quality.
Domain-specific prompts produce more concise and consistent responses.
Common prompts lead to diverse and irregular responses.
Abstract
Recently, large language models (LLMs) have expanded into various domains. However, there remains a need to evaluate how these models perform when prompted with commonplace queries compared to domain-specific queries, which may be useful for benchmarking prior to fine-tuning for domain-specific downstream tasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across diverse domains, including cybersecurity, medicine, and finance, compared to common knowledge queries. This study utilizes a comprehensive methodology to assess foundational models, which includes problem formulation, data analysis, and the development of ThroughCut, a novel outlier detection technique that automatically identifies response throughput outliers based on their conciseness. This methodological rigor enhances the credibility of the presented evaluation frameworks. This study focused on assessing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Systems and Decision Making
