Evaluating the Efficacy of Foundational Models: Advancing Benchmarking   Practices to Enhance Fine-Tuning Decision-Making

Oluyemi Enoch Amujo; Shanchieh Jay Yang

arXiv:2407.11006·cs.CL·August 22, 2024

Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making

Oluyemi Enoch Amujo, Shanchieh Jay Yang

PDF

Open Access

TL;DR

This paper evaluates large language models across multiple domains using a comprehensive framework, introducing a novel outlier detection method to improve benchmarking and inform fine-tuning decisions.

Contribution

It presents a new evaluation methodology and the ThroughCut outlier detection technique, enhancing the reliability of LLM benchmarking across diverse domains.

Findings

01

Model size and prompt type significantly affect response quality.

02

Domain-specific prompts produce more concise and consistent responses.

03

Common prompts lead to diverse and irregular responses.

Abstract

Recently, large language models (LLMs) have expanded into various domains. However, there remains a need to evaluate how these models perform when prompted with commonplace queries compared to domain-specific queries, which may be useful for benchmarking prior to fine-tuning for domain-specific downstream tasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across diverse domains, including cybersecurity, medicine, and finance, compared to common knowledge queries. This study utilizes a comprehensive methodology to assess foundational models, which includes problem formulation, data analysis, and the development of ThroughCut, a novel outlier detection technique that automatically identifies response throughput outliers based on their conciseness. This methodological rigor enhances the credibility of the presented evaluation frameworks. This study focused on assessing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplex Systems and Decision Making