Large Language Model Evaluation Via Multi AI Agents: Preliminary results
Zeeshan Rasheed, Muhammad Waseem, Kari Syst\"a, Pekka Abrahamsson

TL;DR
This paper introduces a multi-agent AI framework for evaluating and comparing the performance of various large language models using code retrieval and verification, providing initial benchmark results.
Contribution
The paper presents a novel multi-agent AI system designed specifically for evaluating LLMs, including a verification agent and initial benchmarking results.
Findings
GPT-3.5 Turbo outperforms other models in initial tests
The multi-agent framework effectively compares different LLMs
Preliminary results establish a baseline for future evaluations
Abstract
As Large Language Models (LLMs) have become integral to both research and daily operations, rigorous evaluation is crucial. This assessment is important not only for individual tasks but also for understanding their societal impact and potential risks. Despite extensive efforts to examine LLMs from various perspectives, there is a noticeable lack of multi-agent AI models specifically designed to evaluate the performance of different LLMs. To address this gap, we introduce a novel multi-agent AI model that aims to assess and compare the performance of various LLMs. Our model consists of eight distinct AI agents, each responsible for retrieving code based on a common description from different advanced language models, including GPT-3.5, GPT-3.5 Turbo, GPT-4, GPT-4 Turbo, Google Bard, LLAMA, and Hugging Face. Our developed model utilizes the API of each language model to retrieve code for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
