Large Language Model Evaluation Via Multi AI Agents: Preliminary results

Zeeshan Rasheed; Muhammad Waseem; Kari Syst\"a; Pekka Abrahamsson

arXiv:2404.01023·cs.SE·April 2, 2024·3 cites

Large Language Model Evaluation Via Multi AI Agents: Preliminary results

Zeeshan Rasheed, Muhammad Waseem, Kari Syst\"a, Pekka Abrahamsson

PDF

Open Access

TL;DR

This paper introduces a multi-agent AI framework for evaluating and comparing the performance of various large language models using code retrieval and verification, providing initial benchmark results.

Contribution

The paper presents a novel multi-agent AI system designed specifically for evaluating LLMs, including a verification agent and initial benchmarking results.

Findings

01

GPT-3.5 Turbo outperforms other models in initial tests

02

The multi-agent framework effectively compares different LLMs

03

Preliminary results establish a baseline for future evaluations

Abstract

As Large Language Models (LLMs) have become integral to both research and daily operations, rigorous evaluation is crucial. This assessment is important not only for individual tasks but also for understanding their societal impact and potential risks. Despite extensive efforts to examine LLMs from various perspectives, there is a noticeable lack of multi-agent AI models specifically designed to evaluate the performance of different LLMs. To address this gap, we introduce a novel multi-agent AI model that aims to assess and compare the performance of various LLMs. Our model consists of eight distinct AI agents, each responsible for retrieving code based on a common description from different advanced language models, including GPT-3.5, GPT-3.5 Turbo, GPT-4, GPT-4 Turbo, Google Bard, LLAMA, and Hugging Face. Our developed model utilizes the API of each language model to retrieve code for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling